Important Note To display the code, click the “Code” button in the body of the document or click the </> Code button at the top right, then select “Show All Code.”
1. Business Problem Statement and Objectives
The client, a major beverage supplier, needs a structured system to optimize logistics between its own fleet of Red Trucks and alternative delivery methods (ARTM), which include partner trucks and third-party carriers known as White Trucks. Red Trucks enhance customer relationships and contribute to revenue, while ARTM offers flexibility but limits interaction and control.
To ensure high-quality service and cost efficiency, I will establish clear fleet allocation guidelines based on customer profiles, transaction data, addresses, and delivery costs. This approach will determine the optimal truck type for each customer using a well-defined annual volume threshold. Additionally, customer segmentation will identify shared characteristics, enabling more strategic and data-driven decision-making.
Based on these insights, I will provide actionable recommendations to optimize fleet allocation and enhance operational efficiency.
2. Analytical Approach and Deliveries
The analysis will be conducted separately for two customer groups:
All Customers – The broader customer base, including those who purchase various product types.
Local Market Partners Buying Fountain Only – Customers who purchase only fountain drinks, excluding CO2, cans, or bottles.
To address logistics challenges and transform decisions into data-driven solutions, the approach will combine predictive models with clustering techniques, using both supervised and unsupervised learning methods to build a structured and efficient logistics framework.
Supervised Learning
Supervised learning techniques will be applied to determine whether each customer should be served by Red Trucks (our own fleet) or White Trucks (ARTM), based on a defined set of criteria.
Refining the Fleet Allocation Strategy: Initial Assumptions
As the dataset does not provide predefined fleet allocation criteria, I will establish initial reference points to guide this analysis:
Annual Volume Threshold: Customers receiving 400 cases and/or gallons per year will initially be assigned to the Red Truck fleet, while those below this threshold will be served by White Trucks.
Unsupervised Learning - Customer Segmentation
A clustering analysis will identify customer groups with similar consumption patterns, refining fleet allocation and enhancing decision-making rules.
Cost Impact Analysis
Considering that the delivery cost of white trucks is five times lower than that of red trucks, different logistics scenarios will be analyzed to compare the costs and strategic impacts of the current approach with the final recommended strategy.
Recommendations
Based on the analysis, data-driven recommendations will be provided to optimize fleet allocation, with a focus on improving service quality, cost efficiency, and strategic decision-making. This approach ensures that each customer receives the most suitable delivery method.
Description of the Data
This project will use four data files provided by the company:
customer_profile.csv
transactional_data.csv
customer_address_and_zip_mapping.csv
delivery_cost_data.xlsx
The customer profile data includes information on all customers they deliver to. This file contains each customer’s unique ID along with various categorical variables that describe their location, industry, and delivery preferences.
The transactional data contains all transactions from all customers with the ordered and delivered amount of product measured in cases and gallons.
The customer address file only contains two columns – zip code and full address. This can be used in tandem with the customer profile data.
The delivery cost data maps the cost of delivering a product based on different criteria. This will be used with the transaction data to find the cost of each transaction.
3 Exploratory Data Analysis (EDA) - Part I
This section analyzes the provided data to identify solutions, with a focus on completeness, consistency, and potential issues. Data transformations may include the creation of new variables to improve model accuracy. Given the large number of variables, the most relevant ones will be prioritized to ensure clarity, while less relevant analyses will be excluded to avoid information overload.
Missing data assessments and any substitutions or modifications will be carried out and will be included in the provided R Markdown file. However, some of these actions will not be displayed in this report to avoid content overload.
Profile Dataset - Cleaning and Adjustments
The number of unique CUSTOMER_NUMBER in profile_data is greater than in transactional data. This will be addressed later before merging the datasets.
There are no duplicates or missing values for CUSTOMER_NUMBER.
Date variables were adjusted to the proper format.
Logical variables were converted to integers, where 0 represents false and 1 represents true.
Special characters and extra spaces were removed in factor variables.
Missing values in the PRIMARY_GROUP_NUMBER field were replaced with zero.
The CHAIN_MEMBER variable was created to indicate whether the outlet belongs to a chain (has a PRIMARY_GROUP_NUMBER). A value of 1 represents a member, and 0 represents a non-member.
Customer Address Dataset - Cleaning and Adjustments
The address was split into new columns for each component.
The dataset does not contain customers’ actual addresses but will be used for data aggregation to support customer segmentation. It includes 145 rows with identical geographic coordinates; however, no ZIP codes are duplicated.
Transactional Dataset - Cleaning and Adjustments
11,131 null values in the ORDER_TYPE column were replaced with “OTHER.”
The DAYS_AFTER column was added to track the number of days since the transaction, up to February 2, 2025.
483 rows with zero values in ORDERED, LOADED, and DELIVERED CASES and GALLONS will be removed from the dataset.
Negative values in DELIVERED_CASES and DELIVERED_GALLONS have been moved to new columns (RETURNED_CASES and RETURNED_GALLONS), and the original columns were set to zero.
30,965 transactions are related to order and/or load but do not have delivery or return data. These will be classified as “order_load” in the DLV_TYPE column.
3.2 Combined Dataset Driven by Transactions
During the exploration, combining all available data was identified as the most effective approach for subsequent analyses. Two files were created: one preserving individual transactions and another compiling information by customer. Both will be used in the exploratory data analysis.
The profile data contains exactly 1801 unique ZIP codes, which were merged with the same number of unique ZIP codes from the customer address dataset. It is important to note that some ZIP codes share the same geographic coordinates, reducing reliability in those cases.
As previously mentioned, the number of unique customer numbers in the profile data (now referred to as full data) is greater than in the transactions dataset. Only customers present in the transactions dataset were included in the merged data.
Code
# Merge customer_address with profile_data using ZIP_CODEfull_data <- profile_data %>%left_join(customer_address, by =c("ZIP_CODE"="ZIP"))# Check the number of unique CUSTOMER_NUMBER in full_data and op_dataprint(length(unique(full_data$CUSTOMER_NUMBER)))print(length(unique(op_data$CUSTOMER_NUMBER)))# Filter full_data to keep only CUSTOMER_NUMBERs that are also in op_data, and merge with op_datafull_data <- full_data %>%filter(CUSTOMER_NUMBER %in% op_data$CUSTOMER_NUMBER) %>%left_join(op_data, by ="CUSTOMER_NUMBER")
Below are the first 5 rows and 6 columns of the combined dataset.
Code
# Display the first few rows of the combined datasethead(full_data[, 1:6], 5)
The variable LOCAL_FOUNT_ONLY will be created to identify whether the transaction’s customer belongs to the “Local Market Partners Buying Fountain Only” group—customers who purchase only fountain drinks, excluding CO2, cans, or bottles. It will be assigned a value of 1 if the customer belongs to this group and 0 otherwise.
Code
# Aggregate total delivered cases and gallons per customercustomer_summary <- full_data %>%group_by(CUSTOMER_NUMBER) %>%summarise(TOTAL_DELIVERED_CASES =sum(DELIVERED_CASES),TOTAL_DELIVERED_GALLONS =sum(DELIVERED_GALLONS),LOCAL_MARKET_PARTNER =max(LOCAL_MARKET_PARTNER),CO2_CUSTOMER =max(CO2_CUSTOMER),.groups ="drop")# Classify customers based on aggregated valuescustomer_summary <- customer_summary %>%mutate(LOCAL_FOUNT_ONLY =case_when( LOCAL_MARKET_PARTNER ==1& CO2_CUSTOMER ==0& TOTAL_DELIVERED_GALLONS >0& TOTAL_DELIVERED_CASES ==0~ 1L,TRUE~ 0L))# Merge back to original datafull_data <- full_data %>%left_join(dplyr::select(customer_summary, CUSTOMER_NUMBER, LOCAL_FOUNT_ONLY), by ="CUSTOMER_NUMBER")# Remove temporary variables and data framesrm(customer_summary)
The code below will create a table for an initial overview of the customer types.
Local Market Partners Fountain Only (LFO) - Delivery Quantities Overview
LFO
customers
pct_cust
transactions
pct_trans
qtd_cas
qtd_gal
pct_gal
total_qtd
pct_qtd
0
28,961
95.5
1,013,652
97
26,434,079
9,086,878
94.1
35,520,957
98.4
1
1,359
4.5
31,405
3
0
573,314
5.9
573,314
1.6
Total
30,320
100.0
1,045,057
100
26,434,079
9,660,192
100.0
36,094,271
100.0
Code
# Remove temporary variables and data framesrm(summary_data, total_row, combined_data)
Only 4.5% of customers are Local Market Partners who do not purchase CO2 and buy only fountain drinks (LFO = 1), accounting for 3% of transactions. They consumed 5.9% of delivered gallons but represent just 1.9% of the total volume (cases + gallons).
This small group of 1,359 customers includes 83 transactions with positive ordered cases. The last order was placed on December 19, 2024, which would allow for some case deliveries to appear in transactions. Since this didn’t occur, these customers will be classified as part of the LFO group, as they consume fountain drinks (gallons), despite ordering cases.
3.3 Combined Dataset Driven by Outlets
The information from the combined transaction dataset (full_data) will now be merged by customer and named full_data_customer. The goal is to create a unique list of customers who have made transactions. This file will contain a large number of columns and will be used for further analysis.
Code
# Creating the YEAR_MONTH column to identify the periodsfull_data <- full_data %>%mutate(YEAR_MONTH =format(as.Date(TRANSACTION_DATE), "%Y_%m"))# Function to count transactions by periodcount_transactions <-function(df, value_column, prefix) { df %>%group_by(CUSTOMER_NUMBER, YEAR_MONTH) %>%summarise(value_count =sum(!!sym(value_column) >0, na.rm =TRUE), .groups ="drop") %>%pivot_wider(names_from = YEAR_MONTH, values_from = value_count, names_prefix = prefix, values_fill =list(value_count =0))}# Counting transactions for each metrictrans_ordered_cases <-count_transactions(full_data, "ORDERED_CASES", "TRANS_ORD_CA_")trans_ordered_gallons <-count_transactions(full_data, "ORDERED_GALLONS", "TRANS_ORD_GAL_")trans_delivered_cases <-count_transactions(full_data, "DELIVERED_CASES", "TRANS_DLV_CA_")trans_delivered_gallons <-count_transactions(full_data, "DELIVERED_GALLONS", "TRANS_DLV_GAL_")trans_returned_cases <-count_transactions(full_data, "RETURNED_CASES", "TRANS_RET_CA_")trans_returned_gallons <-count_transactions(full_data, "RETURNED_GALLONS", "TRANS_RET_GAL_")# Function to sum the values by periodsum_transactions <-function(df, value_column, prefix) { df %>%group_by(CUSTOMER_NUMBER, YEAR_MONTH) %>%summarise(value_sum =sum(!!sym(value_column), na.rm =TRUE), .groups ="drop") %>%pivot_wider(names_from = YEAR_MONTH, values_from = value_sum, names_prefix = prefix, values_fill =list(value_sum =0))}# Summing transactions for each metricqtd_ordered_cases <-sum_transactions(full_data, "ORDERED_CASES", "QTD_ORD_CA_")qtd_ordered_gallons <-sum_transactions(full_data, "ORDERED_GALLONS", "QTD_ORD_GAL_")qtd_delivered_cases <-sum_transactions(full_data, "DELIVERED_CASES", "QTD_DLV_CA_")qtd_delivered_gallons <-sum_transactions(full_data, "DELIVERED_GALLONS", "QTD_DLV_GAL_")qtd_returned_cases <-sum_transactions(full_data, "RETURNED_CASES", "QTD_RET_CA_")qtd_returned_gallons <-sum_transactions(full_data, "RETURNED_GALLONS", "QTD_RET_GAL_")# Ensure the columns in column_order are present in full_datacolumn_order <-c("CUSTOMER_NUMBER", "PRIMARY_GROUP_NUMBER", "FREQUENT_ORDER_TYPE", "FIRST_DELIVERY_DATE", "ON_BOARDING_DATE", "LOCAL_FOUNT_ONLY","COLD_DRINK_CHANNEL", "TRADE_CHANNEL", "SUB_TRADE_CHANNEL", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "ZIP_CODE", "CHAIN_MEMBER", "CITY", "STATE", "COUNTY", "REGION", "LATITUDE", "LONGITUDE")# Check if all columns exist in full_datamissing_cols <-setdiff(column_order, colnames(full_data))if (length(missing_cols) >0) {stop("The following columns are missing in full_data: ", paste(missing_cols, collapse =", "))}# Count the number of transactions per customertrans_count <- full_data %>%group_by(CUSTOMER_NUMBER) %>%summarise(TRANSACTIONS_DATE_COUNT =n(), .groups ="drop")# Joining the data with the required columns in the desired orderfull_data_customer <-distinct(full_data[, column_order]) %>%left_join(trans_count, by ="CUSTOMER_NUMBER") %>%left_join(trans_ordered_cases, by ="CUSTOMER_NUMBER") %>%left_join(trans_ordered_gallons, by ="CUSTOMER_NUMBER") %>%left_join(trans_delivered_cases, by ="CUSTOMER_NUMBER") %>%left_join(trans_delivered_gallons, by ="CUSTOMER_NUMBER") %>%left_join(trans_returned_cases, by ="CUSTOMER_NUMBER") %>%left_join(trans_returned_gallons, by ="CUSTOMER_NUMBER") %>%left_join(qtd_ordered_cases, by ="CUSTOMER_NUMBER") %>%left_join(qtd_ordered_gallons, by ="CUSTOMER_NUMBER") %>%left_join(qtd_delivered_cases, by ="CUSTOMER_NUMBER") %>%left_join(qtd_delivered_gallons, by ="CUSTOMER_NUMBER") %>%left_join(qtd_returned_cases, by ="CUSTOMER_NUMBER") %>%left_join(qtd_returned_gallons, by ="CUSTOMER_NUMBER")# Rename Order Typesfull_data <- full_data %>%mutate(ORDER_TYPE = dplyr::recode(ORDER_TYPE, "CALL CENTER"="CALL.CENTER","MYCOKE LEGACY"="MYCOKE.LEGACY","SALES REP"="SALES.REP"))# Count transactions by ORDER_TYPEorder_type_count <- full_data %>%group_by(CUSTOMER_NUMBER, ORDER_TYPE) %>%summarise(order_type_count =n(), .groups ="drop") %>%pivot_wider(names_from = ORDER_TYPE, values_from = order_type_count, names_prefix ="OT_", values_fill =list(order_type_count =0))# Count transactions by DLV_TYPEdlv_type_count <- full_data %>%group_by(CUSTOMER_NUMBER, DLV_TYPE) %>%summarise(dlv_type_count =n(), .groups ="drop") %>%pivot_wider(names_from = DLV_TYPE, values_from = dlv_type_count, names_prefix ="DLVT_", values_fill =list(dlv_type_count =0))# Join with the full_data_customer to ensure ORDER_TYPE and DLV_TYPE columns are addedfull_data_customer <- full_data_customer %>%left_join(order_type_count, by ="CUSTOMER_NUMBER") %>%left_join(dlv_type_count, by ="CUSTOMER_NUMBER")# Adding the requested summary columnsfull_data_customer <- full_data_customer %>%mutate(TOTAL_CASES_ORDERED =rowSums(full_data_customer[, grep("^QTD_ORD_CA_", names(full_data_customer))]),TOTAL_CASES_DELIVERED =rowSums(full_data_customer[, grep("^QTD_DLV_CA_", names(full_data_customer))]),TOTAL_GALLONS_ORDERED =rowSums(full_data_customer[, grep("^QTD_ORD_GAL_", names(full_data_customer))]),TOTAL_GALLONS_DELIVERED =rowSums(full_data_customer[, grep("^QTD_DLV_GAL_", names(full_data_customer))]),TOTAL_CASES_RETURNED =rowSums(full_data_customer[, grep("^QTD_RET_CA_", names(full_data_customer))]),TOTAL_GALLONS_RETURNED =rowSums(full_data_customer[, grep("^QTD_RET_GAL_", names(full_data_customer))]))# Ensuring column orderot_columns <-colnames(order_type_count)[-1]dlvt_columns <-colnames(dlv_type_count)[-1]summary_columns <-c("TOTAL_CASES_ORDERED", "TOTAL_CASES_DELIVERED", "TOTAL_GALLONS_ORDERED", "TOTAL_GALLONS_DELIVERED", "TOTAL_CASES_RETURNED", "TOTAL_GALLONS_RETURNED")transaction_columns <-grep("^TRANS_", colnames(full_data_customer), value =TRUE)quantity_columns <-grep("^QTD_", colnames(full_data_customer), value =TRUE)ordered_columns <-c(column_order, "TRANSACTIONS_DATE_COUNT", ot_columns, dlvt_columns, summary_columns, sort(transaction_columns), sort(quantity_columns))# Reordering full_data_customerfull_data_customer <- full_data_customer[, ordered_columns]# Replacing NAs with 0 in transaction and quantity columnsfull_data_customer[is.na(full_data_customer)] <-0# Extra variables# Define reference dateref_date <-as.Date("2025-02-01")# 1. DAYS_FIRST_DLVfull_data_customer$DAYS_FIRST_DLV <-as.numeric(difftime(ref_date, full_data_customer$FIRST_DELIVERY_DATE, units ="days"))# 2. DAYS_ONBOARDINGfull_data_customer$DAYS_ONBOARDING <-as.numeric(difftime(ref_date, full_data_customer$ON_BOARDING_DATE, units ="days"))# 3. Average transactions per month# Replace NA with 0 for missing transactionsfull_data_customer[is.na(full_data_customer)] <-0# Calculate the average transaction per monthcols_to_average_dlv <-grep("^TRANS_DLV_CA", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_dlv] <-lapply(full_data_customer[cols_to_average_dlv], as.numeric)full_data_customer$AVG_TRANS_DLV_CA_M <-rowMeans(full_data_customer[, cols_to_average_dlv], na.rm =TRUE)cols_to_average_gal <-grep("^TRANS_DLV_GAL", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_gal] <-lapply(full_data_customer[cols_to_average_gal], as.numeric)full_data_customer$AVG_TRANS_DLV_GAL_M <-rowMeans(full_data_customer[, cols_to_average_gal], na.rm =TRUE)cols_to_average_ord_ca <-grep("^TRANS_ORD_CA", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_ord_ca] <-lapply(full_data_customer[cols_to_average_ord_ca], as.numeric)full_data_customer$AVG_TRANS_ORD_CA_M <-rowMeans(full_data_customer[, cols_to_average_ord_ca], na.rm =TRUE)cols_to_average_ord_gal <-grep("^TRANS_ORD_GAL", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_ord_gal] <-lapply(full_data_customer[cols_to_average_ord_gal], as.numeric)full_data_customer$AVG_TRANS_ORD_GAL_M <-rowMeans(full_data_customer[, cols_to_average_ord_gal], na.rm =TRUE)cols_to_average_ret_ca <-grep("^TRANS_RET_CA", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_ret_ca] <-lapply(full_data_customer[cols_to_average_ret_ca], as.numeric)full_data_customer$AVG_TRANS_RET_CA_M <-rowMeans(full_data_customer[, cols_to_average_ret_ca], na.rm =TRUE)cols_to_average_ret_gal <-grep("^TRANS_RET_GAL", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_ret_gal] <-lapply(full_data_customer[cols_to_average_ret_gal], as.numeric)full_data_customer$AVG_TRANS_RET_GAL_M <-rowMeans(full_data_customer[, cols_to_average_ret_gal], na.rm =TRUE)# 4. Number of transactions per year (sum annual columns)full_data_customer$NUM_TRANS_ORD_CA_23 <-rowSums(full_data_customer[, grep("^TRANS_ORD_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_ORD_CA_24 <-rowSums(full_data_customer[, grep("^TRANS_ORD_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_ORD_GAL_23 <-rowSums(full_data_customer[, grep("^TRANS_ORD_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_ORD_GAL_24 <-rowSums(full_data_customer[, grep("^TRANS_ORD_GAL_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_DLV_CA_23 <-rowSums(full_data_customer[, grep("^TRANS_DLV_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_DLV_CA_24 <-rowSums(full_data_customer[, grep("^TRANS_DLV_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_DLV_GAL_23 <-rowSums(full_data_customer[, grep("^TRANS_DLV_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_DLV_GAL_24 <-rowSums(full_data_customer[, grep("^TRANS_DLV_GAL_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_RET_CA_23 <-rowSums(full_data_customer[, grep("^TRANS_RET_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_RET_CA_24 <-rowSums(full_data_customer[, grep("^TRANS_RET_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_RET_GAL_23 <-rowSums(full_data_customer[, grep("^TRANS_RET_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_RET_GAL_24 <-rowSums(full_data_customer[, grep("^TRANS_RET_GAL_2024", names(full_data_customer))], na.rm =TRUE)# 5. Sum of quantities per yearfull_data_customer$QTD_ORD_CA_2023 <-rowSums(full_data_customer[, grep("^QTD_ORD_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_ORD_GAL_2023 <-rowSums(full_data_customer[, grep("^QTD_ORD_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_ORD_CA_2024 <-rowSums(full_data_customer[, grep("^QTD_ORD_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_ORD_GAL_2024 <-rowSums(full_data_customer[, grep("^QTD_ORD_GAL_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_DLV_CA_2023 <-rowSums(full_data_customer[, grep("^QTD_DLV_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_DLV_GAL_2023 <-rowSums(full_data_customer[, grep("^QTD_DLV_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_DLV_CA_2024 <-rowSums(full_data_customer[, grep("^QTD_DLV_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_DLV_GAL_2024 <-rowSums(full_data_customer[, grep("^QTD_DLV_GAL_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_RET_CA_2023 <-rowSums(full_data_customer[, grep("^QTD_RET_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_RET_GAL_2023 <-rowSums(full_data_customer[, grep("^QTD_RET_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_RET_CA_2024 <-rowSums(full_data_customer[, grep("^QTD_RET_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_RET_GAL_2024 <-rowSums(full_data_customer[, grep("^QTD_RET_GAL_2024", names(full_data_customer))], na.rm =TRUE)# 6. Create new columns for CUST_23 and CUST_24full_data_customer$ACTIVE_23 <-ifelse((full_data_customer$QTD_DLV_CA_2023 + full_data_customer$QTD_DLV_GAL_2023) >0, 1, 0)full_data_customer$ACTIVE_24 <-ifelse((full_data_customer$QTD_DLV_CA_2024 + full_data_customer$QTD_DLV_GAL_2024) >0, 1, 0)# Display the first few rows of the combined dataset#head(full_data_customer)
3.4 Estimated Delivery Costs
The delivery costs will reflect estimated volumes, as they were provided based on the median price within volume ranges and by type of COLD_DRINK_CHANNEL.
All costs are being calculated correctly. At this moment, percentage variations for the number of operations, demands, and costs have not been generated because not all customers have a history for 2023 and 2024, which prevents such calculations. However, methods to quantify the growth of each customer will be identified later.
3.5 Target Variables: Initial Assumptions
As initially explained, we will establish classifications related to the target variables to create an initial reference point.
3.5.1 - Demand Threshold and Fleet Assingment
The average annual consumption per customer will be calculated and customers will be classified based on whether they exceed the threshold of 400 units (cases plus gallons).
Code
# Calculating the averagefull_data_customer$AVG_ANNUAL_CONSUMP <-round((full_data_customer$QTD_DLV_CA_GAL_2023 + full_data_customer$QTD_DLV_CA_GAL_2024) /2, 1)# Creating the THRESHOLD_REACH variablefull_data_customer$THRESHOLD_REACH <-ifelse(full_data_customer$AVG_ANNUAL_CONSUMP <400, 0, 1)# Summarize data by THRESHOLD_REACHdata_threshold_reach <- full_data_customer %>%group_by(THRESHOLD_REACH) %>%summarise(CustomerCount =n(), .groups ='drop') %>%mutate(Percentage =round(CustomerCount /sum(CustomerCount) *100, 1)) # Calculate percentage# Display the tablekable(data_threshold_reach, col.names =c("Threshold Reach", "Customer Count", "Percentage (%)"), format ="simple")
Threshold Reach
Customer Count
Percentage (%)
0
23081
76.1
1
7239
23.9
About 23,081 (76%) of all customers did not reach the threshold of 400 gallons on average per year, while the remaining 7,239 did.
Customers who exceed 400 units annually will be assigned to Red Trucks, while the remaining customers will be allocated to White Trucks.
Code
# Create the FLEET_TYPE column based on THRESHOLD_REACH onlyfull_data_customer$FLEET_TYPE <-ifelse(full_data_customer$THRESHOLD_REACH ==1, "RED TRUCK", "WHITE TRUCK")# Group and calculate the number of customers by FLEET_TYPE and LOCAL_FOUNT_ONLYsummary_fleet_type <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_customers =n(),.groups ="drop" )# Calculate percentage of customers within each LOCAL_FOUNT_ONLY group separatelysummary_fleet_type <- summary_fleet_type %>%group_by(LOCAL_FOUNT_ONLY) %>%mutate(pct_customers = total_customers /sum(total_customers) *100# Calculate the percentage within each LOCAL_FOUNT_ONLY group )# Transform data into long format for percentagessummary_fleet_type_long <- summary_fleet_type %>%pivot_longer(cols =starts_with("pct_"),names_to ="metric",values_to ="percentage" ) %>%mutate(metric =factor(metric, levels =c("pct_customers"),labels =c("Percentage of Customers")) )# Ensure LOCAL_FOUNT_ONLY and FLEET_TYPE are factorssummary_fleet_type_long$LOCAL_FOUNT_ONLY <-factor(summary_fleet_type_long$LOCAL_FOUNT_ONLY, levels =c("0", "1"))summary_fleet_type_long$FLEET_TYPE <-factor(summary_fleet_type_long$FLEET_TYPE, levels =c("RED TRUCK", "WHITE TRUCK"))# Plot for percentages with FLEET_TYPE as colors and LOCAL_FOUNT_ONLY as groupsggplot(summary_fleet_type_long, aes(x = LOCAL_FOUNT_ONLY, y = percentage, fill = FLEET_TYPE)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label = scales::comma(percentage, suffix ="%")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +labs(title ="Percentage of Customers by Fleet Type and Local Fountain Only") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +# Set colors for RED and WHITE TRUCKtheme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_blank(), # Remove legend titlelegend.position ="right", # Position legend on the right sidelegend.box ="vertical", # Ensure vertical arrangement for the legendpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +guides(fill =guide_legend(title ="Fleet Type")) # Add a legend title
Code
# Group and calculate the number of customers by FLEET_TYPE and LOCAL_FOUNT_ONLYsummary_fleet_type_count <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_customers =n(),.groups ="drop" )# Display the summary with the count of customers by fleet type and LOCAL_FOUNT_ONLY#summary_fleet_type_count
According to these criteria, 13% of Local Fountain Only customers would be assigned to RED TRUCK. Among the other customers, 24% would receive deliveries via RED TRUCK.
Code
# Group by LOCAL_FOUNT_ONLY and FLEET_TYPE, then calculate the total delivered volume (QTD_DLV_TOTAL)summary_fleet_type_total <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_QTD_DLV =sum(QTD_DLV_TOTAL, na.rm =TRUE), # Sum of QTD_DLV_TOTAL for each FLEET_TYPE and LOCAL_FOUNT_ONLY.groups ="drop" )# Ensure LOCAL_FOUNT_ONLY and FLEET_TYPE are factors for better plottingsummary_fleet_type_total$LOCAL_FOUNT_ONLY <-factor(summary_fleet_type_total$LOCAL_FOUNT_ONLY, levels =c("0", "1"))summary_fleet_type_total$FLEET_TYPE <-factor(summary_fleet_type_total$FLEET_TYPE, levels =c("RED TRUCK", "WHITE TRUCK"))# Plot the total delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLYggplot(summary_fleet_type_total, aes(x = LOCAL_FOUNT_ONLY, y = total_QTD_DLV, fill = FLEET_TYPE)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label = scales::comma(total_QTD_DLV)), position =position_dodge(width =0.8), vjust =0.0, size =3.5) +labs(title ="Total Delivered Volume by Fleet Type and Local Fountain Only (23 & 24)") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +# Set colors for RED and WHITE TRUCKtheme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_blank(), # Remove legend titlelegend.position ="right", # Position legend on the right sidelegend.box ="vertical", # Ensure vertical arrangement for the legendpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +guides(fill =guide_legend(title ="Fleet Type")) # Add a legend title
Code
# Group and calculate the total delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLYsummary_fleet_type_count <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_QTD_DLV =sum(QTD_DLV_TOTAL, na.rm =TRUE),.groups ="drop" )# Display the summary with the total delivered volume by FLEET_TYPE and LOCAL_FOUNT_ONLY#summary_fleet_type_count
The vast majority of the volume would be delivered by RED TRUCK (85% of the total), with the remaining portion delivered by WHITE TRUCK (15%).
Code
# Group by LOCAL_FOUNT_ONLY and FLEET_TYPE and calculate total delivered volume (QTD_DLV_TOTAL)summary_fleet_type_pct <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_QTD_DLV =sum(QTD_DLV_TOTAL, na.rm =TRUE), # Sum of QTD_DLV_TOTAL for each FLEET_TYPE and LOCAL_FOUNT_ONLY.groups ="drop" ) %>%group_by(LOCAL_FOUNT_ONLY) %>%mutate(pct_QTD_DLV = total_QTD_DLV /sum(total_QTD_DLV) *100# Calculate the percentage of delivered volume per LOCAL_FOUNT_ONLY group )# Ensure LOCAL_FOUNT_ONLY and FLEET_TYPE are factors for better plottingsummary_fleet_type_pct$LOCAL_FOUNT_ONLY <-factor(summary_fleet_type_pct$LOCAL_FOUNT_ONLY, levels =c("0", "1"))summary_fleet_type_pct$FLEET_TYPE <-factor(summary_fleet_type_pct$FLEET_TYPE, levels =c("RED TRUCK", "WHITE TRUCK"))# Plot the percentage of delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLYggplot(summary_fleet_type_pct, aes(x = LOCAL_FOUNT_ONLY, y = pct_QTD_DLV, fill = FLEET_TYPE)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label =paste0(round(pct_QTD_DLV, 1), "%")), position =position_dodge(width =0.8), vjust =0.0, size =3.5) +labs(title ="Percentage of Delivered Volume by Fleet Type and Local Fountain Only (23 & 24)") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +# Set colors for RED and WHITE TRUCKtheme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_blank(), # Remove legend titlelegend.position ="right", # Position legend on the right sidelegend.box ="vertical", # Ensure vertical arrangement for the legendpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +guides(fill =guide_legend(title ="Fleet Type")) # Add a legend title
Code
# Group and calculate the percentage of delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLYsummary_fleet_type_count_pct <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_QTD_DLV =sum(QTD_DLV_TOTAL, na.rm =TRUE),.groups ="drop" ) %>%group_by(LOCAL_FOUNT_ONLY) %>%mutate(pct_QTD_DLV = total_QTD_DLV /sum(total_QTD_DLV) *100# Percentage of delivered volume within each group )# Display the summary with the percentage of delivered volume by FLEET_TYPE and LOCAL_FOUNT_ONLY#summary_fleet_type_count_pct
Considering the customer groups independently, nearly 59% of the volume delivered to local partners purchasing fountain only would be transported by RED TRUCKS, while for the remaining customers, almost 85% of the volume would be delivered by RED TRUCKS.
3.6 - Questions and Considerations on Missing Data and Unknown Classes
After the first portion of the EDA, there is a better understanding of the data, but not all questions have been answered. These will continue to be explored in the next section, though some may remain unresolved due to the nature of the questions. The following questions have been identified:
Based on the available data, what would be a robust statistical approach to calculate the customer growth rate? A simplistic approach was initially used, relying on the average as a reference to visualize the data. However, a more validated method could certainly be applied.
What is the average load capacity of a Red Truck compared to a White Truck?
Adding an ID for individual account executives to the customer profile data could be valuable. Is the quality of the account executive a confounding variable when looking at high growth rate customers?
Does the company set a delivery deadline in days or hours?
4. Exploratory Data Analysis (EDA) - Part II
After completing the initial analysis and building the datasets, focusing on the set objectives, we will explore more detailed information about the customers.
4.1 Customers overview
Geographical Distribution of Customers
Although the location data is not real, below you can observe its distribution.
Code
# Load the U.S. mapus_map <-map_data("state")# Create the plotggplot() +geom_polygon(data = us_map, aes(x = long, y = lat, group = group),fill ="lightblue", color ="white") +geom_point(data = full_data_customer, aes(x = LONGITUDE, y = LATITUDE),color ="#B33951", alpha =0.6, size =0.5) +coord_fixed(1.3) +theme_minimal() +labs(title ="Customers Geographical Distribution") +theme(axis.text.x =element_blank(),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),panel.grid.major =element_blank(),panel.grid.minor =element_blank())
Code
# Calculate the number of unique customersunique_customers <-length(unique(full_data_customer$CUSTOMER_NUMBER))# Display the frequency of each value for the 'CHAIN_MEMBER' columnchain_member_count <-table(full_data_customer$CHAIN_MEMBER)# Calculate the number of unique primary group numbersunique_primary_groups <-length(unique(full_data_customer$PRIMARY_GROUP_NUMBER))# Sum the costs for cases and gallons in 2023 and 2024cost_dlv <-sum(full_data_customer$COST_CA_23, full_data_customer$COST_CA_24, full_data_customer$COST_GAL_23, full_data_customer$COST_GAL_24, na.rm =TRUE)# Summing the number of transactions for cases and gallons in 2023 and 2024trans_dlv <-sum(full_data_customer$NUM_TRANS_DLV_CA_23, full_data_customer$NUM_TRANS_DLV_CA_24, full_data_customer$NUM_TRANS_DLV_GAL_23, full_data_customer$NUM_TRANS_DLV_GAL_24, na.rm =TRUE)# Summing the quantity delivered of cases and gallons in 2023 and 2024qtd_dlv <-sum( full_data_customer$QTD_DLV_CA_2023, full_data_customer$QTD_DLV_GAL_2023, full_data_customer$QTD_DLV_CA_2024, full_data_customer$QTD_DLV_GAL_2024,na.rm =TRUE)# Average cost per delivery transactionavg_cost_per_transaction <- cost_dlv / trans_dlv# Average cost per case or gallon deliveredavg_cost_per_quantity <- cost_dlv / qtd_dlv# Display resultsunique_customers # Number of unique customerschain_member_count # Frequency count of each chain memberunique_primary_groups # Number of unique primary group numberscost_dlv # Total cost for cases and gallons in 2023 and 2024trans_dlv # Total number of transactions for cases and gallons in 2023 and 2024avg_cost_per_transaction # Average cost per delivery transactionavg_cost_per_quantity # Average cost per case or gallon delivered
After removing customers who did not make any transactions in 2023 and 2024, there are 30,320 unique customers who made transactions during these years.
Of these, 18,061 are unique outlets, while 12,259 belong to 1,020 different chains that have transacted with the company.
All of their delivery transactions represented a total cost of approximately $67,907,394,
with an average of $55.8 per delivery transaction and $1.88 per case or gallon delivered.
4.2 Local Market Partners (Fountain Only)
Code
# Cleanclean_data <- full_data %>%filter(!is.na(LOCAL_FOUNT_ONLY)) %>%# Filtering data where LOCAL_FOUNT_ONLY is not NAmutate(LOCAL_FOUNT_ONLY =factor(LOCAL_FOUNT_ONLY, levels =c("0", "1"))) # Converting to factor# Aggregate data by LOCAL_FOUNT_ONLY and create the plotsummary_data <- clean_data %>%group_by(LOCAL_FOUNT_ONLY) %>%summarise(customers =n_distinct(CUSTOMER_NUMBER),transactions =n(),qtd_cas =sum(DELIVERED_CASES, na.rm =TRUE),qtd_gal =sum(DELIVERED_GALLONS, na.rm =TRUE),total_qtd = qtd_cas + qtd_gal,.groups ="drop" ) %>%mutate(pct_cust = customers /sum(customers) *100,pct_trans = transactions /sum(transactions) *100,pct_qtd = total_qtd /sum(total_qtd) *100,pct_gal = qtd_gal /sum(qtd_gal) *100 ) %>%rename(LFO = LOCAL_FOUNT_ONLY)# Transform data to long formatsummary_data_long <- summary_data %>%pivot_longer(cols =starts_with("pct_"), names_to ="metric", values_to ="percentage") %>%mutate(metric =factor(metric, levels =c("pct_cust", "pct_trans", "pct_gal", "pct_qtd"),labels =c("Customers", "Delivery Transactions", "Gallons", "Total (Cases+Gallons)")) )# Convert LFO to factorsummary_data_long$LFO <-factor(summary_data_long$LFO, levels =c("0", "1"))# Create the plotggplot(summary_data_long, aes(x = LFO, y = percentage, fill = LFO)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label =paste0(round(percentage, 1), "%")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +facet_wrap(~ metric, scales ="fixed", ncol =2) +labs(title ="Percentage Breakdown by Consumption Pattern") +scale_fill_manual(values =c("0"="#8ED081", "1"="#A7ADC6")) +scale_y_continuous(labels =percent_format(scale =1)) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.position ="none",panel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only"))
Local market partners who purchase only fountain drinks (Gallons) account for 4.5% of the customers and represent 6% of the company’s gallons demand. Their delivery transaction volume is low, contributing only 3%, and the volume delivered accounts for just 1.6% of the total negotiated volume.
Code
# Group and calculate sums and percentagessummary_full_data <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY) %>%summarise(total_cost_gal =sum(COST_GAL_23, na.rm =TRUE) +sum(COST_GAL_24, na.rm =TRUE),total_cost_ca =sum(COST_CA_23, na.rm =TRUE) +sum(COST_CA_24, na.rm =TRUE),total_cost_all = total_cost_gal + total_cost_ca,.groups ="drop" ) %>%mutate(pct_cost_gal = total_cost_gal / total_cost_all *100,pct_cost_ca = total_cost_ca / total_cost_all *100,pct_total = total_cost_all /sum(total_cost_all) *100 ) %>%rename(LFO = LOCAL_FOUNT_ONLY)# Transform data into long format for totalssummary_full_data_long <- summary_full_data %>%pivot_longer(cols =starts_with("total_"), names_to ="metric", values_to ="value" ) %>%mutate(metric =factor(metric, levels =c("total_cost_gal", "total_cost_ca", "total_cost_all"),labels =c("Cost Gallons (23 & 24)", "Cost Cases (23 & 24)", "Total Cost")) )# For percentagessummary_full_data_pct_long <- summary_full_data %>%pivot_longer(cols =starts_with("pct_"), names_to ="metric", values_to ="percentage" ) %>%mutate(metric =factor(metric, levels =c("pct_cost_gal", "pct_cost_ca", "pct_total"),labels =c("Percentage Cost Gallons (23 & 24)", "Percentage Cost Cases (23 & 24)", "Percentage Total Cost")) )# Ensure LFO is a factorsummary_full_data_long$LFO <-factor(summary_full_data_long$LFO, levels =c("0", "1"))# Plot for total costsggplot(summary_full_data_long, aes(x = LFO, y = value, fill = LFO)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label = scales::comma(value, prefix ="$")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +facet_wrap(~ metric, scales ="fixed", nrow =1) +labs(title ="Total Costs by Consumption Pattern") +scale_fill_manual(values =c("0"="#8ED081", "1"="#A7ADC6")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.position ="none",panel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only"))
In the years 2023 and 2024, the total delivery cost was 67.9 million, of which only 1.2 million was allocated to local market partners.
Code
# Group and calculate sums and percentages by LFOsummary_full_data <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY) %>%summarise(total_cost_gal =sum(COST_GAL_23, na.rm =TRUE) +sum(COST_GAL_24, na.rm =TRUE),total_cost_ca =sum(COST_CA_23, na.rm =TRUE) +sum(COST_CA_24, na.rm =TRUE),total_cost_all = total_cost_gal + total_cost_ca,.groups ="drop" ) %>%mutate(pct_cost_gal = total_cost_gal /sum(total_cost_gal) *100, pct_cost_ca = total_cost_ca /sum(total_cost_ca) *100, pct_total = total_cost_all /sum(total_cost_all) *100 ) %>%rename(LFO = LOCAL_FOUNT_ONLY)# Transform data into long format for percentagessummary_full_data_pct_long <- summary_full_data %>%pivot_longer(cols =starts_with("pct_"), names_to ="metric", values_to ="percentage" ) %>%mutate(metric =factor(metric, levels =c("pct_cost_gal", "pct_cost_ca", "pct_total"),labels =c("% Cost - Gallons", "% Cost - Cases", "% Total Cost")) )# Ensure LFO is a factorsummary_full_data_pct_long$LFO <-factor(summary_full_data_pct_long$LFO, levels =c("0", "1"))# Plot for percentagesggplot(summary_full_data_pct_long, aes(x = LFO, y = percentage, fill = LFO)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label = scales::comma(percentage, suffix ="%")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +facet_wrap(~ metric, scales ="fixed", nrow =1) +labs(title ="Percentage Costs by Consumption Pattern") +scale_fill_manual(values =c("0"="#8ED081", "1"="#A7ADC6")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.position ="none",panel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only"))
Thus, in 2023 and 2024, the local partners who consume only fountain accounted for 1.8% of the total delivery costs. When we look at their share specifically in gallon deliveries, their participation rises to 7.3%.
4.3 Customers History
Below is the chart showing the density of customers in relation to the start of their partnership and their first delivery.
Code
# Gather the data for ON_BOARDING_DATE and FIRST_DELIVERY_DATE, filtering out 2025 dataprofile_data_long <- profile_data %>%filter(!is.na(ON_BOARDING_DATE) &!is.na(FIRST_DELIVERY_DATE)) %>%# Filter out 2025 data to avoid showing deliveries in that yearfilter(format(FIRST_DELIVERY_DATE, "%Y") !="2025") %>%pivot_longer(cols =c(ON_BOARDING_DATE, FIRST_DELIVERY_DATE), names_to ="Event", values_to ="Date") %>%# Set factor levels to ensure ON_BOARDING_DATE appears first in the plotmutate(Event =factor(Event, levels =c("ON_BOARDING_DATE", "FIRST_DELIVERY_DATE")))# Create density plots with facet_wrapggplot(profile_data_long, aes(x = Date, fill = Event, color = Event)) +geom_density(alpha =0.5) +# Adjust transparency for better visualizationfacet_wrap(~ Event, scales ="free", ncol =2) +# Create facets for each variablelabs(title ="Density Plots of Onboarding and First Delivery Dates",x ="Date",y ="Density") +scale_fill_manual(values =c("steelblue", "orange")) +# Set custom colors (first delivery = orange)scale_color_manual(values =c("steelblue", "orange")) +scale_y_continuous(labels = scales::label_number()) +# Remove scientific notation on Y axistheme_minimal() +theme(legend.position ="none") # Remove the legend for a cleaner plot
The vast majority of customers started to appear after 2010. The figures for the first deliveries show that, since 2016, at least 2,000 customers have received their first delivery each year. There were peaks in 2016 and 2017. In 2024, there was a decrease in the number of customers receiving their first delivery compared to 2023.
Code
# Reshape data: Gather ON_BOARDING_DATE and FIRST_DELIVERY_DATEprofile_data_long <- profile_data %>%filter(!is.na(ON_BOARDING_DATE) &!is.na(FIRST_DELIVERY_DATE)) %>%pivot_longer(cols =c(ON_BOARDING_DATE, FIRST_DELIVERY_DATE), names_to ="Event", values_to ="Date")# Set factor levels to ensure ON_BOARDING_DATE appears first in the plotprofile_data_long$Event <-factor(profile_data_long$Event, levels =c("ON_BOARDING_DATE", "FIRST_DELIVERY_DATE"))# Ensure Date is in Date formatprofile_data_long$Date <-as.Date(profile_data_long$Date)# Create histograms with yearly aggregationggplot(profile_data_long, aes(x = Date, fill = Event)) +geom_histogram(binwidth =365, color ="black", alpha =0.5, position ="identity") +facet_wrap(~ Event, scales ="free_x", ncol =2) +# Free scaling for X axislabs(title ="Distribution of Customer Onboarding and First Delivery Dates",x ="Date",y ="Count") +scale_fill_manual(values =c("steelblue", "orange")) +# Custom colors for eventsscale_x_date(labels = scales::date_format("%Y"), expand =c(0.01, 0.01)) +# Show only year on X-axisscale_y_continuous(labels = scales::label_number()) +# Ensure the y-axis is not in scientific notationtheme_minimal() +theme(legend.position ="none", # Remove legendaxis.text.x =element_text(hjust =-0.1), axis.ticks.x =element_blank(), panel.grid.major.x =element_blank(), # Remove vertical gridlinespanel.grid.minor.x =element_blank()) # Remove minor vertical gridlines
4.4 Order Types
The way orders are placed and by whom is important for understanding customer growth potential. Most customer profiles are associated with sales representatives (65.7%). Other methods follow with 17.6%, and MyCoke 360 accounts for nearly 8%, despite only being launched in Summer 2024 to replace MyCoke Legacy.
However, when analyzing actual transactions from 2023 and 2024, the distribution of order types differs significantly from the customer profiles. For example, sales representatives were responsible for only 27.5% of the orders, not 65.7% as listed in the profiles. Therefore, the analysis will be based on actual transactions rather than profile data.
Below are the percentages cases ordered in 2023 and 2024 by order type for each transaction placed in 2023 and 2024.
Code
# Define the custom color palette (Neutral colors from RColorBrewer's "Set3")custom_palette_type <-brewer.pal(6, "Set3") # A 6-color palette from Set3# Summarize data by ORDER_TYPE, summing ORDERED_CASESdata_summary_order_type <- full_data %>%group_by(ORDER_TYPE) %>%summarise(OrderedCasesSum =sum(ORDERED_CASES, na.rm =TRUE), .groups ='drop') %>%mutate(Percentage =round(OrderedCasesSum /sum(OrderedCasesSum) *100, 1),Percentage =ifelse(Percentage <0.15, NA, Percentage)) # Set values less than 0.15% to NA (not displayed)# Create the horizontal bar chart with percentages, now with no aggregation by LOCAL_FOUNT_ONLYggplot(data_summary_order_type, aes(x = OrderedCasesSum, y =reorder(ORDER_TYPE, OrderedCasesSum), fill = ORDER_TYPE)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =ifelse(!is.na(Percentage), paste(Percentage, "%"), "")), # Only display text if Percentage is not NAposition =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="Percentage of Ordered Case Volumes by Order Type (23 & 24)",x =NULL, y =NULL) +scale_x_continuous(labels =NULL, expand =expansion(c(0, 0.05))) +scale_fill_manual(values = custom_palette_type) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), legend.position ="none", # Remove the legend, as we don't need it anymorepanel.grid.major =element_blank(), panel.grid.minor =element_blank())
In the 2023 and 2024 ordered cases transactions, it’s clear that the majority of operations were carried out through digital channels, specifically MyCoke Legacy and MyCoke 360, accounting for 35.5%. This was followed by sales representatives with 25.4%, and call centers with 15.5%. MyCoke 360, which was recently launched, makes up 7.1% of the transactions.
Code
# Summarize data by ORDER_TYPE and LOCAL_FOUNT_ONLY, summing ORDERED_GALLONSdata_summary_order_type <- full_data %>%group_by(ORDER_TYPE, LOCAL_FOUNT_ONLY) %>%summarise(OrderedGallonsSum =sum(ORDERED_GALLONS, na.rm =TRUE), .groups ='drop') %>%mutate(Percentage =round(OrderedGallonsSum /sum(OrderedGallonsSum) *100, 1),Percentage =ifelse(Percentage <0.0, NA, Percentage)) # Set values less than 0.15% to NA (not displayed)# Create the horizontal bar chart with percentages, facet by LOCAL_FOUNT_ONLYggplot(data_summary_order_type, aes(x = OrderedGallonsSum, y =reorder(ORDER_TYPE, OrderedGallonsSum), fill = ORDER_TYPE)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =ifelse(!is.na(Percentage), paste(Percentage, "%"), "")), # Only display text if Percentage is not NAposition =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="% of Ordered Gallons by Order Type and Customer type (2023 & 2024)",x =NULL, y =NULL) +scale_x_continuous(labels =NULL, expand =expansion(c(0, 0.05))) +scale_fill_manual(values = custom_palette_type) +# Apply the custom color palette for ORDER_TYPEtheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), legend.position ="none", # Hide the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank(),strip.text =element_text(face ="bold", size =10), strip.background =element_blank()) +facet_wrap(~ LOCAL_FOUNT_ONLY, scales ="free_y", ncol =2, labeller =labeller(LOCAL_FOUNT_ONLY =c('0'='Others', '1'='Local Fountain Only'))) # Facet labels
For gallon orders, only a very small fraction (less than 6%) is represented by Local Market Partners that order Fountain Only. For these customers, the majority of their orders are placed via the call center (2.4%), followed by digital channels (2.2%), and finally sales reps (1.3%).
For the remaining customers, digital channels represent 34.6% (MyCoke360 + Legacy), sales reps 32.5%, and call centers 24.5%.
It can be said that digital channels are the most used, accounting for approximately 35% of the total volume of cases and gallons for all customers. Sales reps have a smaller proportional share for case orders but carry more weight for gallon orders.
Code
# Summarize data by ORDER_TYPE, summing DELIVERED_CASES and DELIVERED_GALLONSdata_summary_order_type <- full_data %>%group_by(ORDER_TYPE) %>%summarise(DeliveredCasesSum =sum(DELIVERED_CASES, na.rm =TRUE),DeliveredGallonsSum =sum(DELIVERED_GALLONS, na.rm =TRUE),.groups ='drop' ) %>%mutate(TotalVolume = DeliveredCasesSum + DeliveredGallonsSum,Percentage =round(TotalVolume /sum(TotalVolume) *100, 1) )# Create horizontal bar chart with both absolute volume and percentageggplot(data_summary_order_type, aes(x = TotalVolume, y =reorder(ORDER_TYPE, TotalVolume), fill = ORDER_TYPE)) +geom_bar(stat ="identity", alpha =0.5) +geom_text(aes(label =paste(scales::comma(TotalVolume, accuracy =1), paste0("(", Percentage, "%)"))),position =position_stack(vjust =0.5),hjust =-0.01,color ="black",size =3.2 ) +scale_x_continuous(labels = scales::comma,breaks =seq(0, max(data_summary_order_type$TotalVolume), by =5000000),expand =expansion(c(0, 0.05)) ) +scale_fill_manual(values = custom_palette_type) +labs(title ="Total Delivered Cases and Gallons by Order Type (23 & 24)",x ="Volume (units)",y =NULL ) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.title.x =element_text(size =10, face ="plain"),legend.position ="none",panel.grid.major =element_blank(),panel.grid.major.x =element_line(color ="lightgray", size =0.5),panel.grid.minor =element_blank() )
In line with the previous points, digital channels account for nearly 36% of the total volume delivered in 2023 and 2024, followed by sales reps at 27.5% and call centers at 18.5%.
Code
# Summarize data by ORDER_TYPE, summing DELIVERED_CASES and DELIVERED_GALLONSdata_summary_order_type <- full_data %>%group_by(ORDER_TYPE) %>%summarise(DeliveredCasesSum =sum(DELIVERED_CASES, na.rm =TRUE),DeliveredGallonsSum =sum(DELIVERED_GALLONS, na.rm =TRUE),.groups ='drop' ) %>%mutate(TotalVolume = DeliveredCasesSum + DeliveredGallonsSum,Percentage =round(TotalVolume /sum(TotalVolume) *100, 1) )# Create horizontal bar chart with absolute volume and percentageggplot(data_summary_order_type, aes(x = TotalVolume, y =reorder(ORDER_TYPE, TotalVolume), fill = ORDER_TYPE)) +geom_bar(stat ="identity", alpha =0.5) +geom_text(aes(label =paste(scales::comma(TotalVolume, accuracy =1), paste0("(", Percentage, "%)"))),position =position_stack(vjust =0.5),hjust =-0.01,color ="black",size =3.2 ) +scale_x_continuous(labels = scales::comma,breaks =seq(0, max(data_summary_order_type$TotalVolume), by =5000000),expand =expansion(c(0, 0.05)) ) +scale_fill_manual(values = custom_palette_type) +labs(title ="Total Delivered Cases and Gallons by Order Type (23 & 24)",x ="Cost $",y =NULL ) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.title.x =element_text(size =10, face ="plain"),legend.position ="none",panel.grid.major =element_blank(),panel.grid.major.x =element_line(color ="lightgray", size =0.5),panel.grid.minor =element_blank() )
Digital channels account for the majority of the costs, representing 40% of the total delivered cost. Notably, call center costs are slightly higher than sales rep costs, suggesting that their smaller volumes are inflating the costs.
Code
# Summarize by ORDER_TYPE and FLEET_TYPE using delivered volumedata_summary_fleet_by_order <- full_data %>%filter(!is.na(FLEET_TYPE), !is.na(ORDER_TYPE)) %>%group_by(ORDER_TYPE, FLEET_TYPE) %>%summarise(TotalDelivered =sum(DELIVERED_CASES + DELIVERED_GALLONS, na.rm =TRUE), .groups ="drop") %>%group_by(ORDER_TYPE) %>%mutate(Percentage =round(TotalDelivered /sum(TotalDelivered) *100, 0))# Order ORDER_TYPE by total delivered volumeorder_levels <- data_summary_fleet_by_order %>%group_by(ORDER_TYPE) %>%summarise(Total =sum(TotalDelivered), .groups ="drop") %>%arrange(Total) %>%pull(ORDER_TYPE)# Reorder as factordata_summary_fleet_by_order$ORDER_TYPE <-factor(data_summary_fleet_by_order$ORDER_TYPE, levels = order_levels)# Plotggplot(data_summary_fleet_by_order, aes(x = TotalDelivered, y = ORDER_TYPE, fill = FLEET_TYPE)) +geom_bar(stat ="identity", position ="stack", alpha =0.6) +geom_text(aes(label =paste0(Percentage, "%")), position =position_stack(vjust =0.5), hjust =0, color ="black", size =3.2) +labs(title ="400 gallons threshold X Delivered Volume by Order Type", x ="Volume (units)", y =NULL, fill ="Fleet Type") +scale_x_continuous(labels =function(x) paste0(x /1e6, "M"),breaks =c(2500000, 5000000, 7500000, 10000000),expand =expansion(c(0, 0.05)) ) +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.title.x =element_text(size =10, face ="plain"),legend.position ="right",legend.direction ="vertical",panel.grid.major.y =element_blank(),panel.grid.major.x =element_line(color ="lightgray", size =0.5),panel.grid.minor =element_blank() )
Sales Rep had the highest internal percentage of customers (62%) who would be served by red trucks if the 400-gallon threshold were applied. On the other hand, Call Center showed the highest percentage of customers who would be served by white trucks.
4.5 Channel Types
More than 50% of transactions were made through the DINING channel, followed by GOODS (16.6%), EVENTS (9.2%), and BULK TRADE (8.4%). The remaining channels each represent less than 5% of the total.
Transactions for Local Partners Fountain Only are almost entirely concentrated in DINING, with 2.7% of transactions compared to 47.8% for other channels.
Code
# Calculate the frequency of each COLD_DRINK_CHANNELdata_summary_cold_drink_channel <- full_data %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Count =n(), .groups ='drop') %>%mutate(Percentage =round(Count /sum(Count) *100, 1))# Create a horizontal bar chart with percentages for COLD_DRINK_CHANNELggplot(data_summary_cold_drink_channel, aes(x = Count, y =reorder(COLD_DRINK_CHANNEL, Count), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =ifelse(!is.na(Percentage), paste(Percentage, "%"), "")), # Only display text if Percentage is not NAposition =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="Percentage of Transactions by Cold Drink Channel",x =NULL, y =NULL) +scale_x_continuous(labels =NULL, expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Use your custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), legend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())
Code
# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of gallons and casesdata_summary <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Volume /sum(Total_Volume) *100, 1)) # Calculate the percentage# Create a horizontal bar chart for the percentage of total volume by cold drink channelggplot(data_summary, aes(x = Total_Volume /1e6, y =reorder(COLD_DRINK_CHANNEL, Total_Volume), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.7) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="Percentage of Total Volume (Gallons and Cases) by Cold Drink Channel",x ="Quantity in Millions", y =NULL) +scale_x_continuous(labels =function(x) paste0(x, "M"),breaks =seq(0, 10, by =2.5),expand =expansion(c(0, 0.05)) ) +geom_vline(xintercept =c(2.5, 5, 7.5, 10), color ="lightgray", linetype ="solid", linewidth =0.3) +scale_fill_manual(values = cold_drink_channel_colors) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"), axis.text.y =element_text(size =10), axis.text.x =element_text(size =10), axis.title.x =element_text(size =10, face ="bold"),legend.position ="none", panel.grid.major =element_blank(), panel.grid.minor =element_blank() )
Dining was the segment with the highest total consumption, accounting for 27% of the total, followed by Bulk Trade with 25.8% and Workplace with 13.4%. The following section analyzes the information separately by packaging type (cases and gallons) and customer type.
Code
# Summarize data by COLD_DRINK_CHANNEL and FLEET_TYPE excluding "CONVENTIONAL"data_summary <- full_data_customer %>%filter(COLD_DRINK_CHANNEL !="CONVENTIONAL") %>%# Exclude "CONVENTIONAL" channelgroup_by(COLD_DRINK_CHANNEL, FLEET_TYPE) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ='drop' ) %>%# Calculate the percentage of each Fleet Type within each Cold Drink Channelgroup_by(COLD_DRINK_CHANNEL) %>%mutate(Percentage = Total_Volume /sum(Total_Volume) *100) %>%ungroup()# Create the horizontal bar plotggplot(data_summary, aes(x = Total_Volume, y =reorder(COLD_DRINK_CHANNEL, Total_Volume), fill = FLEET_TYPE)) +geom_bar(stat ="identity", position ="stack", alpha =0.7) +geom_text(aes(label =paste0(round(Percentage), "%")), position =position_stack(vjust =0.5), hjust =0.4, color ="black", size =3.2) +# Round percentages and remove decimal placeslabs(title ="400 gallons Threshold - Total Volume by Cold Drink Channel",x ="Total Volume (in Millions)", y =NULL) +scale_x_continuous(labels = scales::comma_format(scale =1e-6, suffix ="M"), # Convert axis to millionsbreaks =seq(2500000, 10000000, by =2500000)) +# Define custom x-axis breaksscale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +# Custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_text(size =10), # X-axis title sizeaxis.text.x =element_text(size =10), # X-axis text sizelegend.position ="bottom", # Position legend below the plotlegend.box ="horizontal", # Display legend items horizontallypanel.grid.major =element_blank(), panel.grid.minor =element_blank()) +# Add vertical lines at specific breaks on the x-axisgeom_vline(xintercept =c(2500000, 5000000, 7500000, 10000000), color ="gray", linetype ="solid", size =0.5)
Code
# Summarize data by COLD_DRINK_CHANNEL and FLEET_TYPEdata_summary <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL, FLEET_TYPE) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ='drop' ) %>%# Calculate the percentage of each Fleet Type within each Cold Drink Channelgroup_by(COLD_DRINK_CHANNEL) %>%mutate(Percentage = Total_Volume /sum(Total_Volume) *100) %>%ungroup()# Create the table#kable(data_summary, format = "markdown", digits = 1, caption = "Total Volume and Percentage by Cold Drink Channel and Fleet Type")
Above are the percentage representations of the volume that would be served by red and white trucks for the 400-gallon threshold. The majority of the volumes would be delivered by red trucks. The “CONVENTIONAL” segment was not displayed due to its extremely low volume, which would overlap with the labels. In this segment, the proportion is 47% for white trucks and 53% for red trucks.
4.5.1 Cold Drink Channel - Delivered Cases for All Customers
Below are the percentages of cases delivered in 2023 and 2024 for all customers by cold drink channel.
Code
# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of cases (QTD_DLV_CA_2023 and QTD_DLV_CA_2024)data_summary_cases <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Cases =sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Cases /sum(Total_Cases) *100, 1)) # Calculate the percentage# Create a bar chart for the percentage of total cases by cold drink channelggplot(data_summary_cases, aes(x = Total_Cases, y =reorder(COLD_DRINK_CHANNEL, Total_Cases), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="All Customers - Percentage of Cases (23 & 24) by Cold Drink Channel",x ="Percentage of Total Cases", y =NULL) +scale_x_continuous(labels = scales::percent_format(scale =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), # Remove the x-axis titleaxis.text.x =element_blank(), # Remove the x-axis textlegend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())
Code
# Summarize data by COLD_DRINK_CHANNEL, summing the delivery cost for cases (COST_CA_23 and COST_CA_24)data_summary_cases_cost <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Cases_Cost =sum(COST_CA_23, na.rm =TRUE) +sum(COST_CA_24, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Cases_Cost /sum(Total_Cases_Cost) *100, 1)) # Calculate the percentage# Create a bar chart for the percentage of total cases cost by cold drink channelggplot(data_summary_cases_cost, aes(x = Total_Cases_Cost, y =reorder(COLD_DRINK_CHANNEL, Total_Cases_Cost), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="All Customers - Percentage of Cases Delivery Cost (23 & 24) by Cold Drink Channel",x ="Percentage of Total Cases Cost", y =NULL) +scale_x_continuous(labels = scales::percent_format(scale =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), # Remove the x-axis titleaxis.text.x =element_blank(), # Remove the x-axis textlegend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())
The main segment receiving cases (bottles, cans, etc.) was Bulk Trade with 33%, followed by Workplace with 17%, and Dining with 14.6%. On the other hand, the segment that presented the highest delivery costs for cases was Dining, accounting for 34% of the cost in 2023 and 2024, followed by Goods at 21%, and Bulk Trade at 16%.
The tables below aim to provide detailed information within this group regarding the number of customers, costs, quartile divisions, and other relevant factors.
Code
# Calculate Total Cases, COST_CA, N.Customers, Perct.Customers, AVG_Qtd, and Median.Qtd, then order and format the tablefull_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Cases =sum(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),COST_CA =sum(COST_CA_23) +sum(COST_CA_24),# Count only customers where Total_Cases > 0N_Customers =n_distinct(CUSTOMER_NUMBER[QTD_DLV_CA_2023 + QTD_DLV_CA_2024 >0]), # Calculate the total cases per customer, excluding customers with zero total casesTotal_Cases_Per_Customer =list(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),Total_Cost_Per_Customer =list(COST_CA_23 + COST_CA_24),.groups ='drop' ) %>%mutate(# Calculate the average COST_CA per Total_CasesAVG_Cost_CA = COST_CA / Total_Cases, # Calculate the percentage of total casesPERCT_CASE =round(Total_Cases /sum(Total_Cases) *100, 1),# Calculate the percentage of total customersPerct_Customers =round(N_Customers /sum(N_Customers) *100, 1), # Calculate percentage of customers# Calculate the average cases per customer (without decimals)AVG_Qtd =round(Total_Cases / N_Customers), # No decimals for AVG_Qtd# Calculate the median of cases per customer, excluding customers with zero casesMedian_Qtd =sapply(Total_Cases_Per_Customer, function(x) {median(x[x >0], na.rm =TRUE) # Only consider positive cases for the median }),# Calculate the median cost per case for each cold drink channel, excluding customers with zero casesMedian_Cost =sapply(1:length(Total_Cases_Per_Customer), function(i) { total_cost <- Total_Cost_Per_Customer[[i]] total_cases <- Total_Cases_Per_Customer[[i]]median(total_cost[total_cases >0] / total_cases[total_cases >0], na.rm =TRUE) # Median cost per case }) ) %>%# Order by Total Cases in descending order (before formatting)arrange(desc(Total_Cases)) %>%# Calculate Opt_Cost for each COLD_DRINK_CHANNEL based on minimum median delivery cost for CASESleft_join( cost_data %>%filter(grepl("CASES", as.character(`RANGE_LEVEL`))) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Opt_Cost =round(min(`Median Delivery Cost`), 2)) %>%ungroup(), # Ensures only 1 line per COLD_DRINK_CHANNELby ="COLD_DRINK_CHANNEL" ) %>%# Format COST_CA, Total_Cases, AVG_Cost_CA, N_Customers, Perct_Customers, AVG_Qtd, Median.Qtd, and Median.Cost after orderingmutate(COST_CA = scales::comma(COST_CA), Total_Cases = scales::comma(Total_Cases), AVG_Cost_CA = scales::comma(AVG_Cost_CA, accuracy =0.01),N_Customers = scales::comma(N_Customers), # Format N_CustomersPERCT_CASE =sprintf("%.1f", PERCT_CASE), # Ensure 1 decimal place for percentagePerct_Customers =sprintf("%.1f", Perct_Customers), # Ensure 1 decimal place for percentageAVG_Qtd = scales::comma(AVG_Qtd), # Format AVG_QtdMedian_Qtd = scales::comma(Median_Qtd), # Format Median_QtdMedian_Cost = scales::comma(Median_Cost, accuracy =0.01), # Format Median_CostOpt_Cost = scales::comma(Opt_Cost, accuracy =0.01) # Format Opt_Cost ) %>%# Select columns in the correct order with exact column names dplyr::select( COLD_DRINK_CHANNEL, Total_Cases, PERCT_CASE, COST_CA, N_Customers, Perct_Customers, AVG_Qtd, Median_Qtd, AVG_Cost_CA, Median_Cost, Opt_Cost ) %>%# Rename columns to match the desired outputrename(`Channel`= COLD_DRINK_CHANNEL,`T.Cases`= Total_Cases,`Cases %`= PERCT_CASE,`T.Cost $`= COST_CA,`N.Cust`= N_Customers,`P.Cust %`= Perct_Customers,`Avg.Qtd.Cust`= AVG_Qtd,`Median.Qtd.Cust`= Median_Qtd,`Avg.Cost.Cust $`= AVG_Cost_CA,`Med.Cost.Cust $`= Median_Cost,`Opt.Cost $`= Opt_Cost ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "T.Cases", "Cases %", "T.Cost $", "N.Cust", "P.Cust %", "Avg.Qtd.Cust", "Median.Qtd.Cust", "Avg.Cost.Cust $", "Med.Cost.Cust $", "Opt.Cost $")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:11, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#ADD8E6") %>%add_header_above(c("CASES (23 & 24) - Deliveries by Cold Drink Channel - All Customers"=11)) %>%kable_paper("striped", full_width =FALSE)
CASES (23 & 24) - Deliveries by Cold Drink Channel - All Customers
Channel
T.Cases
Cases %
T.Cost $
N.Cust
P.Cust %
Avg.Qtd.Cust
Median.Qtd.Cust
Avg.Cost.Cust $
Med.Cost.Cust $
Opt.Cost $
BULK TRADE
8,687,959
32.9
8,127,990
1,278
5.3
6,798
1,239
0.94
3.53
0.73
WORKPLACE
4,567,596
17.3
2,299,625
712
2.9
6,415
164
0.50
8.06
0.37
DINING
3,859,778
14.6
17,429,159
10,929
45.2
353
82
4.52
8.59
2.05
GOODS
3,494,064
13.2
10,780,042
5,542
22.9
630
205
3.09
7.33
1.09
EVENT
2,796,241
10.6
5,677,840
2,785
11.5
1,004
230
2.03
5.59
1.17
PUBLIC SECTOR
1,422,915
5.4
2,805,085
1,411
5.8
1,008
244
1.97
4.97
1.07
WELLNESS
903,700
3.4
1,529,502
340
1.4
2,658
532
1.69
4.61
1.23
ACCOMMODATION
695,490
2.6
2,507,698
1,150
4.8
605
326
3.61
5.54
1.48
CONVENTIONAL
6,337
0.0
73,937
53
0.2
120
64
11.67
14.20
5.34
Code
############# Calculate Quartiles, Customer Count, and Volume Distributionfull_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(# Store the total cases per customer, excluding zero valuesCases_Per_Customer =list(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),.groups ='drop' ) %>%mutate(# Calculate the average and median cases per customer`Avg.Qtd.Cust`=sapply(Cases_Per_Customer, function(x) mean(x[x >0])),`Median.Qtd.Cust`=sapply(Cases_Per_Customer, function(x) median(x[x >0])),# Compute quartiles for quantity`1Quart.Qtd`=sapply(Cases_Per_Customer, function(x) quantile(x[x >0], 0.25)),`2Quart.Qtd`=sapply(Cases_Per_Customer, function(x) quantile(x[x >0], 0.50)), # Median (Q2)`3Quart.Qtd`=sapply(Cases_Per_Customer, function(x) quantile(x[x >0], 0.75)) ) %>%rowwise() %>%# Ensure calculations are row-wise based on quartile valuesmutate(# Extract case values from the listCase_Values =list(unlist(Cases_Per_Customer)),# Calculate total cases volume per quartile using the correct conditions`1Quart.Vol`=sum(Case_Values[which(Case_Values >0& Case_Values <=`1Quart.Qtd`)]),`2Quart.Vol`=sum(Case_Values[which(Case_Values >`1Quart.Qtd`& Case_Values <=`2Quart.Qtd`)]),`3Quart.Vol`=sum(Case_Values[which(Case_Values >`2Quart.Qtd`& Case_Values <=`3Quart.Qtd`)]),`4Quart.Vol`=sum(Case_Values[which(Case_Values >`3Quart.Qtd`)]),# Calculate the total volume for the quartiles (1 to 4) in each channelTotal_Vol =`1Quart.Vol`+`2Quart.Vol`+`3Quart.Vol`+`4Quart.Vol`,# Calculate percentages based on the sum of volumes from all quartiles for each channel`1Q.Vol%`=round((`1Quart.Vol`/ Total_Vol) *100, 1),`2Q.Vol%`=round((`2Quart.Vol`/ Total_Vol) *100, 1),`3Q.Vol%`=round((`3Quart.Vol`/ Total_Vol) *100, 1),`4Q.Vol%`=round((`4Quart.Vol`/ Total_Vol) *100, 1) ) %>%ungroup() %>%# Remove row-wise grouping# Order by Avg.Qtd.Cust in descending orderarrange(desc(`Avg.Qtd.Cust`)) %>%# Format numbers for readabilitymutate(`Avg.Qtd.Cust`= scales::comma(`Avg.Qtd.Cust`, accuracy =1),`Median.Qtd.Cust`= scales::comma(`Median.Qtd.Cust`, accuracy =1),`1Quart.Qtd`= scales::comma(`1Quart.Qtd`, accuracy =1),`2Quart.Qtd`= scales::comma(`2Quart.Qtd`, accuracy =1),`3Quart.Qtd`= scales::comma(`3Quart.Qtd`, accuracy =1),`1Quart.Vol`= scales::comma(`1Quart.Vol`, accuracy =1),`2Quart.Vol`= scales::comma(`2Quart.Vol`, accuracy =1),`3Quart.Vol`= scales::comma(`3Quart.Vol`, accuracy =1),`4Quart.Vol`= scales::comma(`4Quart.Vol`, accuracy =1),`1Q.Vol%`=formatC(`1Q.Vol%`, format ="f", digits =1),`2Q.Vol%`=formatC(`2Q.Vol%`, format ="f", digits =1),`3Q.Vol%`=formatC(`3Q.Vol%`, format ="f", digits =1),`4Q.Vol%`=formatC(`4Q.Vol%`, format ="f", digits =1) ) %>%# Select only required columns dplyr::select( COLD_DRINK_CHANNEL, `Avg.Qtd.Cust`, `Median.Qtd.Cust`, `1Quart.Qtd`, `2Quart.Qtd`, `3Quart.Qtd`, `1Quart.Vol`, `1Q.Vol%`, `2Quart.Vol`, `2Q.Vol%`, `3Quart.Vol`, `3Q.Vol%`, `4Quart.Vol`, `4Q.Vol%` ) %>%# Rename columnsrename(`Channel`= COLD_DRINK_CHANNEL ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "Avg.Qtd.Cust", "Median.Qtd.Cust", "1Quart.Qtd", "2Quart.Qtd", "3Quart.Qtd", "1Quart.Vol", "1Q.Vol%", "2Quart.Vol", "2Q.Vol%", "3Quart.Vol", "3Q.Vol%", "4Quart.Vol", "4Q.Vol%")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:14, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#ADD8E6") %>%add_header_above(c("CASES (23 & 24) - Quartile Analysis by Cold Drink Channel - All Customers"=14)) %>%kable_paper("striped", full_width =FALSE)
CASES (23 & 24) - Quartile Analysis by Cold Drink Channel - All Customers
Channel
Avg.Qtd.Cust
Median.Qtd.Cust
1Quart.Qtd
2Quart.Qtd
3Quart.Qtd
1Quart.Vol
1Q.Vol%
2Quart.Vol
2Q.Vol%
3Quart.Vol
3Q.Vol%
4Quart.Vol
4Q.Vol%
BULK TRADE
6,798
1,239
384
1,239
4,162
53,070
0.6
236,734
2.7
743,904
8.6
7,654,251
88.1
WORKPLACE
6,415
164
40
164
546
2,622
0.1
16,174
0.4
54,740
1.2
4,494,060
98.4
WELLNESS
2,658
532
101
532
2,400
3,224
0.4
22,891
2.5
99,518
11.0
778,067
86.1
PUBLIC SECTOR
1,008
244
68
244
760
10,971
0.8
48,995
3.4
162,280
11.4
1,200,669
84.4
EVENT
1,004
230
56
230
753
17,174
0.6
89,181
3.2
296,563
10.6
2,393,323
85.6
GOODS
630
205
98
205
466
68,717
2.0
206,577
5.9
414,603
11.9
2,804,166
80.3
ACCOMMODATION
605
326
99
326
668
12,725
1.8
57,122
8.2
138,511
19.9
487,132
70.0
DINING
353
82
16
82
318
17,073
0.4
116,307
3.0
479,779
12.4
3,246,620
84.1
CONVENTIONAL
120
64
26
64
138
228
3.6
583
9.2
1,196
18.9
4,330
68.3
The tables above can be used for different analyses, which will not be discussed here. It is worth highlighting that the bulk trade sector has a high number of outliers, which cause its annual volume average to be very high, while the median is about 5 times lower. This impact can also be observed in the delivery costs.
4.5.2 Cold Drink Channel - Delivered Gallons for All Customers
Code
# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of gallons (QTD_DLV_GAL_2023 and QTD_DLV_GAL_2024)data_summary_gallons <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Gallons =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Gallons /sum(Total_Gallons) *100, 1)) # Calculate the percentage# Create a bar chart for the percentage of total gallons by cold drink channelggplot(data_summary_gallons, aes(x = Total_Gallons, y =reorder(COLD_DRINK_CHANNEL, Total_Gallons), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="All Customers - Percentage of Gallons (23 & 24) by Cold Drink Channel",x ="Percentage of Total Gallons", y =NULL) +scale_x_continuous(labels = scales::percent_format(scale =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), # Remove the x-axis titleaxis.text.x =element_blank(), # Remove the x-axis textlegend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())
Code
# Summarize data by COLD_DRINK_CHANNEL, summing the delivery cost for gallons (COST_GAL_23 and COST_GAL_24)data_summary_gallons_cost <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Gallons_Cost =sum(COST_GAL_23, na.rm =TRUE) +sum(COST_GAL_24, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Gallons_Cost /sum(Total_Gallons_Cost) *100, 1)) # Calculate the percentage# Create a bar chart for the percentage of total gallons cost by cold drink channelggplot(data_summary_gallons_cost, aes(x = Total_Gallons_Cost, y =reorder(COLD_DRINK_CHANNEL, Total_Gallons_Cost), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="All Customers - Percentage of Gallons Delivery Cost (23 & 24) by Cold Drink Channel",x ="Percentage of Total Gallons Cost", y =NULL) +scale_x_continuous(labels = scales::percent_format(scale =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), # Remove the x-axis titleaxis.text.x =element_blank(), # Remove the x-axis textlegend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())
For gallons, the dining segment is the most representative, accounting for 61% of the volume delivered in 2023 and 2024, and 73% of the cost of gallons. The second segment is events, with 18.7% (10% of the cost), followed by bulk trade with 6.5% (3% of the cost).
The tables below aim to provide detailed information within this group regarding the number of customers, costs, quartile divisions, and other relevant factors.
Code
# Calculate Total Gallons, COST_GAL, N.Customers, Perct.Customers, AVG_Qtd, and Median.Qtd, then order and format the tablefull_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Gallons =sum(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),COST_GAL =sum(COST_GAL_23) +sum(COST_GAL_24),# Count only customers where Total_Gallons > 0N_Customers =n_distinct(CUSTOMER_NUMBER[QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024 >0]), # Calculate the total gallons per customer, excluding customers with zero total gallonsTotal_Gallons_Per_Customer =list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),Total_Cost_Per_Customer =list(COST_GAL_23 + COST_GAL_24),.groups ='drop' ) %>%mutate(# Calculate the average COST_GAL per Total_GallonsAVG_Cost_GAL = COST_GAL / Total_Gallons, # Calculate the percentage of total gallonsPERCT_GAL =round(Total_Gallons /sum(Total_Gallons) *100, 1),# Calculate the percentage of total customersPerct_Customers =round(N_Customers /sum(N_Customers) *100, 1), # Calculate percentage of customers# Calculate the average gallons per customer (without decimals)AVG_Qtd =round(Total_Gallons / N_Customers), # No decimals for AVG_Qtd# Calculate the median of gallons per customer, excluding customers with zero gallonsMedian_Qtd =sapply(Total_Gallons_Per_Customer, function(x) {median(x[x >0], na.rm =TRUE) # Only consider positive gallons for the median }),# Calculate the median cost per gallon for each cold drink channel, excluding customers with zero gallonsMedian_Cost =sapply(1:length(Total_Gallons_Per_Customer), function(i) { total_cost <- Total_Cost_Per_Customer[[i]] total_gallons <- Total_Gallons_Per_Customer[[i]]median(total_cost[total_gallons >0] / total_gallons[total_gallons >0], na.rm =TRUE) # Median cost per gallon }) ) %>%# Order by Total Gallons in descending order (before formatting)arrange(desc(Total_Gallons)) %>%# Calculate Opt_Cost for each COLD_DRINK_CHANNEL based on minimum median delivery cost for GALLONSleft_join( cost_data %>%filter(grepl("GALLONS", as.character(`RANGE_LEVEL`))) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Opt_Cost =round(min(`Median Delivery Cost`), 2)) %>%ungroup(), # Ensures only 1 line per COLD_DRINK_CHANNELby ="COLD_DRINK_CHANNEL" ) %>%# Format COST_GAL, Total_Gallons, AVG_Cost_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median.Qtd, and Median.Cost after orderingmutate(COST_GAL = scales::comma(COST_GAL), Total_Gallons = scales::comma(Total_Gallons), AVG_Cost_GAL = scales::comma(AVG_Cost_GAL, accuracy =0.01),N_Customers = scales::comma(N_Customers), # Format N_CustomersPERCT_GAL =sprintf("%.1f", PERCT_GAL), # Ensure 1 decimal place for percentagePerct_Customers =sprintf("%.1f", Perct_Customers), # Ensure 1 decimal place for percentageAVG_Qtd = scales::comma(AVG_Qtd), # Format AVG_QtdMedian_Qtd = scales::comma(Median_Qtd), # Format Median_QtdMedian_Cost = scales::comma(Median_Cost, accuracy =0.01), # Format Median_CostOpt_Cost = scales::comma(Opt_Cost, accuracy =0.01) # Format Opt_Cost ) %>%# Select columns in the correct order with exact column names dplyr::select( COLD_DRINK_CHANNEL, Total_Gallons, PERCT_GAL, COST_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median_Qtd, AVG_Cost_GAL, Median_Cost, Opt_Cost ) %>%# Rename columns to match the desired outputrename(`Channel`= COLD_DRINK_CHANNEL,`T.Gallons`= Total_Gallons,`Gallons %`= PERCT_GAL,`T.Cost $`= COST_GAL,`N.Cust`= N_Customers,`P.Cust %`= Perct_Customers,`Avg.Qtd.Cust`= AVG_Qtd,`Median.Qtd.Cust`= Median_Qtd,`Avg.Cost.Cust $`= AVG_Cost_GAL,`Med.Cost.Cust $`= Median_Cost,`Opt.Cost $`= Opt_Cost ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "T.Gallons", "Gallons %", "T.Cost $", "N.Cust", "P.Cust %", "Avg.Qtd.Cust", "Median.Qtd.Cust", "Avg.Cost.Cust $", "Med.Cost.Cust $", "Opt.Cost $")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:11, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#FFCCCB") %>%add_header_above(c("GALLONS (23 & 24) - Deliveries by Cold Drink Channel - All Customers"=11)) %>%kable_paper("striped", full_width =FALSE)
GALLONS (23 & 24) - Deliveries by Cold Drink Channel - All Customers
Channel
T.Gallons
Gallons %
T.Cost $
N.Cust
P.Cust %
Avg.Qtd.Cust
Median.Qtd.Cust
Avg.Cost.Cust $
Med.Cost.Cust $
Opt.Cost $
DINING
5,881,701
60.9
12,164,673
11,267
71.3
522
235.0
2.07
3.49
0.82
EVENT
1,802,976
18.7
1,711,570
1,473
9.3
1,224
287.5
0.95
2.95
0.39
BULK TRADE
631,817
6.5
532,474
464
2.9
1,362
347.5
0.84
2.84
0.39
PUBLIC SECTOR
460,586
4.8
711,092
635
4.0
725
210.0
1.54
3.73
0.69
WORKPLACE
258,877
2.7
436,789
682
4.3
380
177.5
1.69
3.15
0.41
WELLNESS
252,103
2.6
380,152
311
2.0
811
460.0
1.51
2.91
0.44
ACCOMMODATION
202,090
2.1
346,403
465
2.9
435
150.0
1.71
3.91
0.42
GOODS
165,540
1.7
417,493
490
3.1
338
182.5
2.52
4.62
0.69
CONVENTIONAL
4,502
0.0
43,144
15
0.1
300
135.0
9.58
19.77
0.72
Code
############# Calculate Quartiles, Customer Count, and Volume Distributionfull_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(# Store the total gallons per customer, excluding zero valuesGallons_Per_Customer =list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),.groups ='drop' ) %>%mutate(# Calculate the average and median gallons per customer`Avg.Qtd.Cust`=sapply(Gallons_Per_Customer, function(x) mean(x[x >0])),`Median.Qtd.Cust`=sapply(Gallons_Per_Customer, function(x) median(x[x >0])),# Compute quartiles for quantity`1Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.25)),`2Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.50)), # Median (Q2)`3Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.75)) ) %>%rowwise() %>%# Ensure calculations are row-wise based on quartile valuesmutate(# Extract gallon values from the listGallon_Values =list(unlist(Gallons_Per_Customer)),# Calculate total gallons volume per quartile using the correct conditions`1Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >0& Gallon_Values <=`1Quart.Qtd`)]),`2Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`1Quart.Qtd`& Gallon_Values <=`2Quart.Qtd`)]),`3Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`2Quart.Qtd`& Gallon_Values <=`3Quart.Qtd`)]),`4Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`3Quart.Qtd`)]),# Calculate the total volume for the quartiles (1 to 4) in each channelTotal_Vol =`1Quart.Vol`+`2Quart.Vol`+`3Quart.Vol`+`4Quart.Vol`,# Calculate percentages based on the sum of volumes from all quartiles for each channel`1Q.Vol%`=round((`1Quart.Vol`/ Total_Vol) *100, 1),`2Q.Vol%`=round((`2Quart.Vol`/ Total_Vol) *100, 1),`3Q.Vol%`=round((`3Quart.Vol`/ Total_Vol) *100, 1),`4Q.Vol%`=round((`4Quart.Vol`/ Total_Vol) *100, 1) ) %>%ungroup() %>%# Remove row-wise grouping# Order by Avg.Qtd.Cust in descending orderarrange(desc(`Avg.Qtd.Cust`)) %>%# Format numbers for readabilitymutate(`Avg.Qtd.Cust`= scales::comma(`Avg.Qtd.Cust`, accuracy =1),`Median.Qtd.Cust`= scales::comma(`Median.Qtd.Cust`, accuracy =1),`1Quart.Qtd`= scales::comma(`1Quart.Qtd`, accuracy =1),`2Quart.Qtd`= scales::comma(`2Quart.Qtd`, accuracy =1),`3Quart.Qtd`= scales::comma(`3Quart.Qtd`, accuracy =1),`1Quart.Vol`= scales::comma(`1Quart.Vol`, accuracy =1),`2Quart.Vol`= scales::comma(`2Quart.Vol`, accuracy =1),`3Quart.Vol`= scales::comma(`3Quart.Vol`, accuracy =1),`4Quart.Vol`= scales::comma(`4Quart.Vol`, accuracy =1),`1Q.Vol%`=formatC(`1Q.Vol%`, format ="f", digits =1),`2Q.Vol%`=formatC(`2Q.Vol%`, format ="f", digits =1),`3Q.Vol%`=formatC(`3Q.Vol%`, format ="f", digits =1),`4Q.Vol%`=formatC(`4Q.Vol%`, format ="f", digits =1) ) %>%# Select only required columns dplyr::select( COLD_DRINK_CHANNEL, `Avg.Qtd.Cust`, `Median.Qtd.Cust`, `1Quart.Qtd`, `2Quart.Qtd`, `3Quart.Qtd`, `1Quart.Vol`, `1Q.Vol%`, `2Quart.Vol`, `2Q.Vol%`, `3Quart.Vol`, `3Q.Vol%`, `4Quart.Vol`, `4Q.Vol%` ) %>%# Rename columnsrename(`Channel`= COLD_DRINK_CHANNEL ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "Avg.Qtd.Cust", "Median.Qtd.Cust", "1Quart.Qtd", "2Quart.Qtd", "3Quart.Qtd", "1Quart.Vol", "1Q.Vol%", "2Quart.Vol", "2Q.Vol%", "3Quart.Vol", "3Q.Vol%", "4Quart.Vol", "4Q.Vol%")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:14, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#FFCCCB") %>%add_header_above(c("GALLONS (23 & 24) - Quartile Analysis by Cold Drink Channel - All Customers"=14)) %>%kable_paper("striped", full_width =FALSE)
GALLONS (23 & 24) - Quartile Analysis by Cold Drink Channel - All Customers
Channel
Avg.Qtd.Cust
Median.Qtd.Cust
1Quart.Qtd
2Quart.Qtd
3Quart.Qtd
1Quart.Vol
1Q.Vol%
2Quart.Vol
2Q.Vol%
3Quart.Vol
3Q.Vol%
4Quart.Vol
4Q.Vol%
BULK TRADE
1,362
348
114
348
1,057
4,870
0.8
25,440
4.0
71,161
11.3
530,345
83.9
EVENT
1,224
288
110
288
800
20,007
1.1
69,634
3.9
182,784
10.1
1,530,550
84.9
WELLNESS
811
460
158
460
919
5,927
2.4
20,742
8.2
52,658
20.9
172,777
68.5
PUBLIC SECTOR
725
210
79
210
554
6,269
1.4
22,497
4.9
54,900
11.9
376,920
81.8
DINING
522
235
88
235
585
121,015
2.1
429,705
7.3
1,082,575
18.4
4,248,406
72.2
ACCOMMODATION
435
150
55
150
518
3,231
1.6
11,618
5.7
35,067
17.4
152,175
75.3
WORKPLACE
380
177
85
177
370
8,121
3.1
21,469
8.3
44,312
17.1
184,975
71.5
GOODS
338
182
82
182
359
5,141
3.1
16,529
10.0
30,573
18.5
113,297
68.4
CONVENTIONAL
300
135
105
135
182
390
8.7
350
7.8
700
15.5
3,062
68.0
The tables above can be used for different analyses, which will not be discussed here. It is worth highlighting that the dining segment has an average consumption of 522 gallons and a median of 235, resulting in a smaller cost difference when compared to the impact of cases for the bulk trade sector.
Code
# Calculate the mean and median for the "DINING" channel, without creating a permanent columnmean_value <-mean( (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024)[ full_data_customer$COLD_DRINK_CHANNEL =="DINING"& (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024) >0 ], na.rm =TRUE)median_value <-median( (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024)[ full_data_customer$COLD_DRINK_CHANNEL =="DINING"& (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024) >0 ], na.rm =TRUE)# Filter data for the "DINING" channel, exclude zero sums, and plot the histogramfull_data_customer %>%filter(COLD_DRINK_CHANNEL =="DINING"& (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024) >0) %>%# Filter for "DINING" and total_gallons > 0mutate(total_gallons = QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024) %>%# Temporarily create 'total_gallons'ggplot(aes(x = total_gallons)) +# Dark gray bars with no bordergeom_histogram(binwidth =0.5, fill ="darkgray", color ="darkgray", alpha =0.7) +# Line for meangeom_vline(aes(xintercept = mean_value, color ="Mean"), linetype ="solid", size =0.6) +# Line for mean# Line for mediangeom_vline(aes(xintercept = median_value, color ="Median"), linetype ="solid", size =0.6) +# Line for median# Customize colors and legend positionscale_color_manual(values =c("Mean"="blue", "Median"="coral"),labels =c(paste("Mean:", round(mean_value, 0)),paste("Median:", round(median_value, 0)))) +labs(title ="Total Gallons Delivered for Dining Channel",subtitle ="(Limited to a Maximum of 5000)",x ="Total Gallons Delivered",y ="Number of Customers" ) +xlim(0, 5000) +# Limit x-axis to 5000theme_minimal() +# Use a minimal themetheme(panel.grid.major.y =element_line(color ="gray", size =0.5), # Add horizontal grid linespanel.grid =element_blank(), # Remove vertical grid linesaxis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(size =14, face ="bold"),plot.subtitle =element_text(size =12, face ="italic"), # Style the subtitlelegend.position ="right", # Move the legend to the rightlegend.title =element_blank(), # Remove the title from the legendlegend.key =element_blank() # Remove the background of the legend )
4.5.3 Cold Drink Channel - Delivered Gallons for Local Market Partners Fountain Only
Code
# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of gallons (QTD_DLV_GAL_2023 and QTD_DLV_GAL_2024), and filter by LOCAL_FOUNT_ONLY == 1data_summary_gallons <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Gallons =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Gallons /sum(Total_Gallons) *100, 1)) # Calculate the percentage# Create a bar chart for the percentage of total gallons by cold drink channelggplot(data_summary_gallons, aes(x = Total_Gallons, y =reorder(COLD_DRINK_CHANNEL, Total_Gallons), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="Local Fountain Only - Percentage of Gallons (23 & 24) by Cold Drink Channel",x ="Percentage of Total Gallons", y =NULL) +scale_x_continuous(labels = scales::percent_format(scale =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), # Remove the x-axis titleaxis.text.x =element_blank(), # Remove the x-axis textlegend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())
Code
# # Summarize data by COLD_DRINK_CHANNEL, summing the delivery cost for gallons (COST_GAL_23 and COST_GAL_24), # # and filter by LOCAL_FOUNT_ONLY == 1# data_summary_gallons_cost <- full_data_customer %>%# filter(LOCAL_FOUNT_ONLY == 1) %>%# group_by(COLD_DRINK_CHANNEL) %>%# summarise(# Total_Gallons_Cost = sum(COST_GAL_23, na.rm = TRUE) + sum(COST_GAL_24, na.rm = TRUE),# .groups = 'drop'# ) %>%# mutate(Percentage = round(Total_Gallons_Cost / sum(Total_Gallons_Cost) * 100, 1)) # Calculate the percentage# # Create a bar chart for the percentage of total gallons cost by cold drink channel for LOCAL_FOUNT_ONLY# ggplot(data_summary_gallons_cost, aes(x = Total_Gallons_Cost, y = reorder(COLD_DRINK_CHANNEL, Total_Gallons_Cost), fill = COLD_DRINK_CHANNEL)) +# geom_bar(stat = "identity", position = "stack", alpha = 0.5) + # geom_text(aes(label = paste(Percentage, "%")), position = position_stack(vjust = 0.5), # hjust = -0.01, color = "black", size = 3.2) +# labs(title = "Local Fountain Only - Percentage of Gallons Delivery Cost (23 & 24) by Cold Drink Channel",# x = "Percentage of Total Gallons Cost", # y = NULL) + # scale_x_continuous(labels = scales::percent_format(scale = 1), expand = expansion(c(0, 0.05))) + # scale_fill_manual(values = cold_drink_channel_colors) + # Apply the custom color palette# theme_minimal() + # theme(plot.title = element_text(size = 10, face = "bold")) + # theme(axis.text.y = element_text(size = 10), # axis.title.x = element_blank(), # Remove the x-axis title# axis.text.x = element_blank(), # Remove the x-axis text# legend.position = "none", # Remove the legend# panel.grid.major = element_blank(), # panel.grid.minor = element_blank())
Among the local drink-only customers, nearly 90% of the demand is represented by the dining segment, followed by event at 4.5% and workplace at 3.5%. Costs followed nearly the same proportions and were therefore not displayed.
The tables below aim to provide detailed information within this group regarding the number of customers, costs, quartile divisions, and other relevant factors.
Code
# Calculate Total Gallons, COST_GAL, N.Customers, Perct.Customers, AVG_Qtd, and Median.Qtd, then order and format the tablefull_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Gallons =sum(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),COST_GAL =sum(COST_GAL_23) +sum(COST_GAL_24),# Count only customers where Total_Gallons > 0N_Customers =n_distinct(CUSTOMER_NUMBER[QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024 >0]), # Calculate the total gallons per customer, excluding customers with zero total gallonsTotal_Gallons_Per_Customer =list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),Total_Cost_Per_Customer =list(COST_GAL_23 + COST_GAL_24),.groups ='drop' ) %>%mutate(# Calculate the average COST_GAL per Total_GallonsAVG_Cost_GAL = COST_GAL / Total_Gallons, # Calculate the percentage of total gallonsPERCT_GAL =round(Total_Gallons /sum(Total_Gallons) *100, 1),# Calculate the percentage of total customersPerct_Customers =round(N_Customers /sum(N_Customers) *100, 1), # Calculate percentage of customers# Calculate the average gallons per customer (without decimals)AVG_Qtd =round(Total_Gallons / N_Customers), # No decimals for AVG_Qtd# Calculate the median of gallons per customer, excluding customers with zero gallonsMedian_Qtd =sapply(Total_Gallons_Per_Customer, function(x) {median(x[x >0], na.rm =TRUE) # Only consider positive gallons for the median }),# Calculate the median cost per gallon for each cold drink channel, excluding customers with zero gallonsMedian_Cost =sapply(1:length(Total_Gallons_Per_Customer), function(i) { total_cost <- Total_Cost_Per_Customer[[i]] total_gallons <- Total_Gallons_Per_Customer[[i]]median(total_cost[total_gallons >0] / total_gallons[total_gallons >0], na.rm =TRUE) # Median cost per gallon }) ) %>%# Order by Total Gallons in descending order (before formatting)arrange(desc(Total_Gallons)) %>%# Calculate Opt_Cost for each COLD_DRINK_CHANNEL based on minimum median delivery cost for GALLONSleft_join( cost_data %>%filter(grepl("GALLONS", as.character(`RANGE_LEVEL`))) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Opt_Cost =round(min(`Median Delivery Cost`), 2)) %>%ungroup(), # Ensures only 1 line per COLD_DRINK_CHANNELby ="COLD_DRINK_CHANNEL" ) %>%# Format COST_GAL, Total_Gallons, AVG_Cost_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median.Qtd, and Median.Cost after orderingmutate(COST_GAL = scales::comma(COST_GAL), Total_Gallons = scales::comma(Total_Gallons), AVG_Cost_GAL = scales::comma(AVG_Cost_GAL, accuracy =0.01),N_Customers = scales::comma(N_Customers), # Format N_CustomersPERCT_GAL =sprintf("%.1f", PERCT_GAL), # Ensure 1 decimal place for percentagePerct_Customers =sprintf("%.1f", Perct_Customers), # Ensure 1 decimal place for percentageAVG_Qtd = scales::comma(AVG_Qtd), # Format AVG_QtdMedian_Qtd = scales::comma(Median_Qtd), # Format Median_QtdMedian_Cost = scales::comma(Median_Cost, accuracy =0.01), # Format Median_CostOpt_Cost = scales::comma(Opt_Cost, accuracy =0.01) # Format Opt_Cost ) %>%# Select columns in the correct order with exact column names dplyr::select( COLD_DRINK_CHANNEL, Total_Gallons, PERCT_GAL, COST_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median_Qtd, AVG_Cost_GAL, Median_Cost, Opt_Cost ) %>%# Rename columns to match the desired outputrename(`Channel`= COLD_DRINK_CHANNEL,`T.Gallons`= Total_Gallons,`Gallons %`= PERCT_GAL,`T.Cost $`= COST_GAL,`N.Cust`= N_Customers,`P.Cust %`= Perct_Customers,`Avg.Qtd.Cust`= AVG_Qtd,`Median.Qtd.Cust`= Median_Qtd,`Avg.Cost.Cust $`= AVG_Cost_GAL,`Med.Cost.Cust $`= Median_Cost,`Opt.Cost $`= Opt_Cost ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "T.Gallons", "Gallons %", "T.Cost $", "N.Cust", "P.Cust %", "Avg.Qtd.Cust", "Median.Qtd.Cust", "Avg.Cost.Cust $", "Med.Cost.Cust $", "Opt.Cost $")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:11, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="darkorange") %>%add_header_above(c("GALLONS (23 & 24) - Deliveries by Cold Drink Channel - Local Fountain Only"=11)) %>%kable_paper("striped", full_width =FALSE)
GALLONS (23 & 24) - Deliveries by Cold Drink Channel - Local Fountain Only
Channel
T.Gallons
Gallons %
T.Cost $
N.Cust
P.Cust %
Avg.Qtd.Cust
Median.Qtd.Cust
Avg.Cost.Cust $
Med.Cost.Cust $
Opt.Cost $
DINING
510,335
89.0
1,091,574
1,150
84.6
444
177.5
2.14
3.98
0.82
EVENT
25,774
4.5
50,622
68
5.0
379
121.2
1.96
3.76
0.39
WORKPLACE
20,029
3.5
30,279
63
4.6
318
116.6
1.51
3.15
0.41
WELLNESS
4,855
0.8
9,603
9
0.7
539
230.0
1.98
3.34
0.44
GOODS
4,759
0.8
15,484
25
1.8
190
106.7
3.25
4.62
0.69
PUBLIC SECTOR
3,655
0.6
12,938
26
1.9
141
81.2
3.54
3.73
0.69
ACCOMMODATION
3,135
0.5
8,055
13
1.0
241
102.5
2.57
3.91
0.42
BULK TRADE
772
0.1
1,902
5
0.4
154
125.0
2.46
2.85
0.39
Code
######## Calculate Quartiles, Customer Count, and Volume Distributionfull_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(# Store the total gallons per customer, excluding zero valuesGallons_Per_Customer =list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),.groups ='drop' ) %>%mutate(# Calculate the average and median gallons per customer`Avg.Qtd.Cust`=sapply(Gallons_Per_Customer, function(x) mean(x[x >0])),`Median.Qtd.Cust`=sapply(Gallons_Per_Customer, function(x) median(x[x >0])),# Compute quartiles for quantity`1Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.25)),`2Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.50)), # Median (Q2)`3Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.75)) ) %>%rowwise() %>%# Ensure calculations are row-wise based on quartile valuesmutate(# Extract gallon values from the listGallon_Values =list(unlist(Gallons_Per_Customer)),# Calculate total gallons volume per quartile using the correct conditions`1Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >0& Gallon_Values <=`1Quart.Qtd`)]),`2Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`1Quart.Qtd`& Gallon_Values <=`2Quart.Qtd`)]),`3Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`2Quart.Qtd`& Gallon_Values <=`3Quart.Qtd`)]),`4Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`3Quart.Qtd`)]),# Calculate the total volume for the quartiles (1 to 4) in each channelTotal_Vol =`1Quart.Vol`+`2Quart.Vol`+`3Quart.Vol`+`4Quart.Vol`,# Calculate percentages based on the sum of volumes from all quartiles for each channel`1Q.Vol%`=round((`1Quart.Vol`/ Total_Vol) *100, 1),`2Q.Vol%`=round((`2Quart.Vol`/ Total_Vol) *100, 1),`3Q.Vol%`=round((`3Quart.Vol`/ Total_Vol) *100, 1),`4Q.Vol%`=round((`4Quart.Vol`/ Total_Vol) *100, 1) ) %>%ungroup() %>%# Remove row-wise grouping# Order by Avg.Qtd.Cust in descending orderarrange(desc(`Avg.Qtd.Cust`)) %>%# Format numbers for readabilitymutate(`Avg.Qtd.Cust`= scales::comma(`Avg.Qtd.Cust`, accuracy =1),`Median.Qtd.Cust`= scales::comma(`Median.Qtd.Cust`, accuracy =1),`1Quart.Qtd`= scales::comma(`1Quart.Qtd`, accuracy =1),`2Quart.Qtd`= scales::comma(`2Quart.Qtd`, accuracy =1),`3Quart.Qtd`= scales::comma(`3Quart.Qtd`, accuracy =1),`1Quart.Vol`= scales::comma(`1Quart.Vol`, accuracy =1),`2Quart.Vol`= scales::comma(`2Quart.Vol`, accuracy =1),`3Quart.Vol`= scales::comma(`3Quart.Vol`, accuracy =1),`4Quart.Vol`= scales::comma(`4Quart.Vol`, accuracy =1),`1Q.Vol%`=formatC(`1Q.Vol%`, format ="f", digits =1),`2Q.Vol%`=formatC(`2Q.Vol%`, format ="f", digits =1),`3Q.Vol%`=formatC(`3Q.Vol%`, format ="f", digits =1),`4Q.Vol%`=formatC(`4Q.Vol%`, format ="f", digits =1) ) %>%# Select only required columns dplyr::select( COLD_DRINK_CHANNEL, `Avg.Qtd.Cust`, `Median.Qtd.Cust`, `1Quart.Qtd`, `2Quart.Qtd`, `3Quart.Qtd`, `1Quart.Vol`, `1Q.Vol%`, `2Quart.Vol`, `2Q.Vol%`, `3Quart.Vol`, `3Q.Vol%`, `4Quart.Vol`, `4Q.Vol%` ) %>%# Rename columnsrename(`Channel`= COLD_DRINK_CHANNEL ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "Avg.Qtd.Cust", "Median.Qtd.Cust", "1Quart.Qtd", "2Quart.Qtd", "3Quart.Qtd", "1Quart.Vol", "1Q.Vol%", "2Quart.Vol", "2Q.Vol%", "3Quart.Vol", "3Q.Vol%", "4Quart.Vol", "4Q.Vol%")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:14, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="darkorange") %>%add_header_above(c("GALLONS (23 & 24) - Quartile Analysis by Cold Drink Channel - Local Fountain Only"=14)) %>%kable_paper("striped", full_width =FALSE)
GALLONS (23 & 24) - Quartile Analysis by Cold Drink Channel - Local Fountain Only
Channel
Avg.Qtd.Cust
Median.Qtd.Cust
1Quart.Qtd
2Quart.Qtd
3Quart.Qtd
1Quart.Vol
1Q.Vol%
2Quart.Vol
2Q.Vol%
3Quart.Vol
3Q.Vol%
4Quart.Vol
4Q.Vol%
WELLNESS
539
230
100
230
998
142
2.9
360
7.4
1,362
28.1
2,990
61.6
DINING
444
178
58
178
481
8,472
1.7
30,716
6.0
87,949
17.2
383,197
75.1
EVENT
379
121
39
121
466
358
1.4
1,282
5.0
5,242
20.3
18,891
73.3
WORKPLACE
318
117
55
117
232
449
2.2
1,357
6.8
2,331
11.6
15,892
79.3
ACCOMMODATION
241
102
85
102
323
255
8.1
198
6.3
833
26.6
1,850
59.0
GOODS
190
107
30
107
222
152
3.2
454
9.5
860
18.1
3,292
69.2
BULK TRADE
154
125
82
125
235
87
11.3
125
16.2
235
30.5
325
42.1
PUBLIC SECTOR
141
81
38
81
195
198
5.4
325
8.9
782
21.4
2,350
64.3
The tables above can be used for different analyses, which will not be discussed here. It is worth highlighting that among the local market partners (fountain only), the average consumption was 444 gallons and the median was 177, resulting in an average cost of $2.14 per gallon, which is nearly half of the cost per gallon for customers, which is $3.98.
4.6 Trade Channel
Code
# Summarize data by TRADE_CHANNEL, summing the quantities of gallons and casesdata_summary_trade_channel <- full_data_customer %>%group_by(TRADE_CHANNEL) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Volume /sum(Total_Volume) *100, 1))# Define a dynamic color palette to handle more than 9 categoriesnum_colors <-length(unique(data_summary_trade_channel$TRADE_CHANNEL))custom_palette <-setNames(colorRampPalette(brewer.pal(9, "Set2"))(num_colors),unique(data_summary_trade_channel$TRADE_CHANNEL))# Create a horizontal bar chart for the percentage of total volume by trade channelggplot(data_summary_trade_channel, aes(x = Total_Volume /1e6, y =reorder(TRADE_CHANNEL, Total_Volume), fill = TRADE_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.7) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5),hjust =-0.01, color ="black", size =3.2) +labs(title ="Percentage of Total Volume (Gallons and Cases) by Trade Channel",x ="Quantity in Millions",y =NULL) +scale_x_continuous(limits =c(0, 7.5),breaks =c(2.5, 5, 7.5),labels =function(x) paste0(x, "M"),expand =expansion(c(0, 0.05)) ) +geom_vline(xintercept =c(2.5, 5, 7.5), color ="lightgray", linetype ="solid", linewidth =0.3) +scale_fill_manual(values = custom_palette) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.text.x =element_text(size =10),axis.title.x =element_text(size =10, face ="bold"),legend.position ="none",panel.grid.major =element_blank(),panel.grid.minor =element_blank() )
Among the trade channels, Fast Casual Dining (19%), Comprehensive Dining (13.4%), and Travel (12%) rank among the top five in terms of total volume demand. These are also the only segments that individually represent more than 10% of the total volume.
4.7 Sub Trade Channel
The sub trade channel consists of 48 classes, so we decided to create a table for reference and queries.
Code
# Create a summary table for the frequency of each unique value in SUB_TRADE_CHANNELdata_summary_sub_trade_channel <- profile_data %>%group_by(SUB_TRADE_CHANNEL) %>%summarise(Count =n()) %>%mutate(Percentage =round(Count /sum(Count) *100, 1)) # Display the interactive table with DTdatatable(data_summary_sub_trade_channel, options =list(pageLength =5, autoWidth =TRUE, dom ='Bfrtip', buttons =c('copy', 'csv', 'excel', 'pdf')))
4.8 CO2 Customers
Code
# Calculate percentages for CO2_CUSTOMERco2_customer_summary <- profile_data %>%group_by(CO2_CUSTOMER) %>%summarise(Count =n()) %>%mutate(Percentage =round(Count /sum(Count) *100, 1))# Create the plotggplot(co2_customer_summary, aes(x = CO2_CUSTOMER, y = Percentage, fill =as.factor(CO2_CUSTOMER))) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label =paste0(Percentage, "%")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +labs(title ="Percentage Breakdown by CO2 Customers Status") +scale_fill_manual(values =c("0"="#8ED081", "1"="#A7ADC6"), labels =c("Non-CO2 Customers", "CO2 Customers")) +scale_y_continuous(labels =percent_format(scale =1)) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_blank(),legend.position ="right", # Position the legend to the rightpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),axis.text.x =element_text(size =10),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Non-CO2 Customers", "1"="CO2 Customers"))
Around 61% of customers do not consume CO2, including all local market partners. However, we still find that the percentage of customers consuming CO2 is relatively high, at nearly 39%.
4.9 Transactions by Cases
Code
# Create the summary table with adjusted minimum value and median considering values > 0summary_cases <-data.frame(type =c("ORDERED_CASES", "LOADED_CASES", "DELIVERED_CASES", "RETURNED_CASES"),# Calculating the minimum value considering values > 0 and rounding to four decimal placesmin =c(round(min(op_data$ORDERED_CASES[op_data$ORDERED_CASES >0]), 4),round(min(op_data$LOADED_CASES[op_data$LOADED_CASES >0]), 4),round(min(op_data$DELIVERED_CASES[op_data$DELIVERED_CASES >0]), 4),round(min(op_data$RETURNED_CASES[op_data$RETURNED_CASES >0]), 4) ),# Median calculation considering only values greater than 0median =c(median(op_data$ORDERED_CASES[op_data$ORDERED_CASES >0]),median(op_data$LOADED_CASES[op_data$LOADED_CASES >0]),median(op_data$DELIVERED_CASES[op_data$DELIVERED_CASES >0]),median(op_data$RETURNED_CASES[op_data$RETURNED_CASES >0]) ),# Maximum without decimal placesmax =c(floor(max(op_data$ORDERED_CASES)),floor(max(op_data$LOADED_CASES)),floor(max(op_data$DELIVERED_CASES)),floor(max(op_data$RETURNED_CASES)) ),# Sum with thousands separatorsum_qtd =c(format(sum(op_data$ORDERED_CASES), big.mark =","),format(sum(op_data$LOADED_CASES), big.mark =","),format(sum(op_data$DELIVERED_CASES), big.mark =","),format(sum(op_data$RETURNED_CASES), big.mark =",") ),# Number of transactions with thousands separatornum_trans =c(format(sum(op_data$ORDERED_CASES >0), big.mark =","),format(sum(op_data$LOADED_CASES >0), big.mark =","),format(sum(op_data$DELIVERED_CASES >0), big.mark =","),format(sum(op_data$RETURNED_CASES >0), big.mark =",") ),# Average quantity per transaction without decimalsavg_qtd_by_trans =c(round(sum(op_data$ORDERED_CASES) /max(1, sum(op_data$ORDERED_CASES >0))),round(sum(op_data$LOADED_CASES) /max(1, sum(op_data$LOADED_CASES >0))),round(sum(op_data$DELIVERED_CASES) /max(1, sum(op_data$DELIVERED_CASES >0))),round(sum(op_data$RETURNED_CASES) /max(1, sum(op_data$RETURNED_CASES >0))) ))# Create the table using kableExtra for better formattingsummary_cases %>%kable("html", escape =FALSE, align ="c") %>%kable_styling(full_width = F, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:7, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#ADD8E6") %>%# Light blue headeradd_header_above(c("CASES - Statistics by transactions greater than 0"=7)) %>%kable_paper("striped", full_width = F)
CASES - Statistics by transactions greater than 0
type
min
median
max
sum_qtd
num_trans
avg_qtd_by_trans
ORDERED_CASES
0.0898
11.5
8479
28,074,470
772,877
36
LOADED_CASES
0.0898
11.0
8171
27,103,098
770,624
35
DELIVERED_CASES
0.0001
11.0
8069
26,434,079
750,872
35
RETURNED_CASES
0.0390
8.0
3132
156,165
2,582
60
Considering all case transactions, we created the table above to generate some key metrics. The values for ORDERED CASES, LOADED CASES, and DELIVERED CASES are similar, as expected. There are records with quantities less than 1 unit, and the maximum values exceed 8,000 cases, with the average per transaction being approximately 35 cases.
The number of transactions for RETURNED CASES is much smaller, but there was a return of 3,132 cases. The average number of cases per transaction is 60.
Code
# Transforming the data to long formatop_data_long <- op_data %>% dplyr::select(ORDERED_CASES, LOADED_CASES, DELIVERED_CASES) %>%pivot_longer(cols =everything(), names_to ="case_type", values_to ="count") %>%mutate(case_type =factor(case_type, levels =c("ORDERED_CASES", "LOADED_CASES", "DELIVERED_CASES")))# Define border colors based on case_typeborder_colors <-c("ORDERED_CASES"="grey", "LOADED_CASES"="lightblue", "DELIVERED_CASES"="darkblue")# Plot with histogramsggplot(op_data_long, aes(x = count)) +geom_histogram(binwidth =1, aes(fill = case_type, color = case_type), alpha =0.7) +facet_wrap(~case_type, scales ="free_x", nrow =1, labeller =as_labeller(c("ORDERED_CASES"="Ordered", "LOADED_CASES"="Loaded", "DELIVERED_CASES"="Delivered"))) +scale_y_continuous(trans ='log10', breaks = scales::trans_breaks("log10", function(x) 10^x), labels = scales::trans_format("log10", scales::math_format(10^.x))) +scale_x_continuous(limits =c(0, 5000)) +# Limit x-axis to 5000scale_color_manual(values = border_colors) +theme_minimal() +labs(title ="Histograms of Case Counts", x ="Case Count", y ="Frequency (Log Scale)") +theme(strip.background =element_blank(), strip.text =element_text(color ="black", size =9), panel.grid.major.x =element_blank(), panel.grid.minor =element_blank(), panel.grid.major.y =element_line(color ="grey", size =0.5), axis.title =element_text(size =7),axis.text =element_text(size =6),plot.title =element_text(size =10, face ="bold", hjust =0.5), strip.text.x =element_text(size =8, hjust =0.5), legend.position ="none", axis.text.y =element_text(size =7), axis.title.y =element_text(size =8), panel.spacing =unit(1, "lines") )
Above, we have the histogram of transactions related to case counts. We have limited the visualization to 5000 cases and applied a logarithmic scale for better interpretation. It is noticeable that the number of transactions decreases near 1900 cases and then increases again around 2000. This could potentially correlate with the larger clients.
Below is the histogram of returned cases, where it is evident that the number of transactions is relatively low, with quantities generally not exceeding 250 cases. There are some transactions exceeding 1,000 cases, but they are rare. These were excluded to make the chart more interpretable.
Code
# Transforming the data to long format for RETURNED_CASESop_data_long_returned <- op_data %>% dplyr::select(RETURNED_CASES) %>%pivot_longer(cols =everything(), names_to ="case_type", values_to ="count") %>%mutate(case_type =factor(case_type, levels =c("RETURNED_CASES")))# Define border colors for RETURNED_CASESborder_colors_returned <-c("RETURNED_CASES"="black")# Plot with histogram for RETURNED_CASESggplot(op_data_long_returned, aes(x = count)) +geom_histogram(binwidth =1, aes(fill = case_type, color = case_type), alpha =0.7) +scale_x_continuous(limits =c(0, 1000)) +# Set max limit for x-axisscale_y_continuous(trans ='log10', breaks = scales::trans_breaks("log10", function(x) 10^x), labels = scales::trans_format("log10", scales::math_format(10^.x))) +scale_color_manual(values = border_colors_returned) +theme_minimal() +labs(title ="Returned Case Counts", x ="Case Count", y ="Frequency (Log Scale)") +theme(strip.background =element_blank(), strip.text =element_text(color ="black", size =9), panel.grid.major.x =element_blank(), panel.grid.minor =element_blank(), panel.grid.major.y =element_line(color ="grey", size =0.5), axis.title =element_text(size =7),axis.text =element_text(size =6),plot.title =element_text(size =10, face ="bold", hjust =0.5), strip.text.x =element_text(size =8, hjust =0.5), legend.position ="none", axis.text.y =element_text(size =7), axis.title.y =element_text(size =8), panel.spacing =unit(1, "lines") )
4.10 Transactions by Gallons
Code
# Create the summary table with adjusted minimum value and median considering values > 0 for GALLONSsummary_gallons <-data.frame(type =c("ORDERED_GALLONS", "LOADED_GALLONS", "DELIVERED_GALLONS", "RETURNED_GALLONS"),# Calculating the minimum value considering values > 0 and rounding to four decimal placesmin =c(round(min(op_data$ORDERED_GALLONS[op_data$ORDERED_GALLONS >0]), 4),round(min(op_data$LOADED_GALLONS[op_data$LOADED_GALLONS >0]), 4),round(min(op_data$DELIVERED_GALLONS[op_data$DELIVERED_GALLONS >0]), 4),round(min(op_data$RETURNED_GALLONS[op_data$RETURNED_GALLONS >0]), 4) ),# Median calculation considering only values greater than 0median =c(median(op_data$ORDERED_GALLONS[op_data$ORDERED_GALLONS >0]),median(op_data$LOADED_GALLONS[op_data$LOADED_GALLONS >0]),median(op_data$DELIVERED_GALLONS[op_data$DELIVERED_GALLONS >0]),median(op_data$RETURNED_GALLONS[op_data$RETURNED_GALLONS >0]) ),# Maximum without decimal placesmax =c(floor(max(op_data$ORDERED_GALLONS)),floor(max(op_data$LOADED_GALLONS)),floor(max(op_data$DELIVERED_GALLONS)),floor(max(op_data$RETURNED_GALLONS)) ),# Sum with no decimal placessum_qtd =c(format(floor(sum(op_data$ORDERED_GALLONS)), big.mark =","),format(floor(sum(op_data$LOADED_GALLONS)), big.mark =","),format(floor(sum(op_data$DELIVERED_GALLONS)), big.mark =","),format(floor(sum(op_data$RETURNED_GALLONS)), big.mark =",") ),# Number of transactions with thousands separatornum_trans =c(format(sum(op_data$ORDERED_GALLONS >0), big.mark =","),format(sum(op_data$LOADED_GALLONS >0), big.mark =","),format(sum(op_data$DELIVERED_GALLONS >0), big.mark =","),format(sum(op_data$RETURNED_GALLONS >0), big.mark =",") ),# Average quantity per transaction without decimalsavg_qtd_by_trans =c(round(sum(op_data$ORDERED_GALLONS) /max(1, sum(op_data$ORDERED_GALLONS >0))),round(sum(op_data$LOADED_GALLONS) /max(1, sum(op_data$LOADED_GALLONS >0))),round(sum(op_data$DELIVERED_GALLONS) /max(1, sum(op_data$DELIVERED_GALLONS >0))),round(sum(op_data$RETURNED_GALLONS) /max(1, sum(op_data$RETURNED_GALLONS >0))) ))# Create the table using kableExtra for better formattingsummary_gallons %>%kable("html", escape =FALSE, align ="c") %>%kable_styling(full_width = F, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:7, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#FFCCCB") %>%# Light blue headeradd_header_above(c("GALLONS - Statistics by transactions greater than 0"=7)) %>%kable_paper("striped", full_width = F)
GALLONS - Statistics by transactions greater than 0
type
min
median
max
sum_qtd
num_trans
avg_qtd_by_trans
ORDERED_GALLONS
0.0898
15.0
2562
10,323,336
482,518
21
LOADED_GALLONS
0.0898
15.0
2562
10,042,299
479,599
21
DELIVERED_GALLONS
0.0159
15.0
2292
9,660,192
464,231
21
RETURNED_GALLONS
0.0156
7.5
1792
32,513
1,760
18
The values for ORDERED GALLONS, LOADED GALLONS, and DELIVERED GALLONS are similar, as expected. There are records with quantities less than 1 unit, and the maximum values exceed 2,200 gallons, with the average per transaction being approximately 21 gallons.
The number of gallon transactions is significantly lower than that of cases, at about 60%.
The number of transactions for RETURNED GALLONS is much smaller, but there was a return of 1,792 gallons. The average number of gallons per transaction is 18.
Code
# Transforming the data to long format for gallonsop_data_long_gallons <- op_data %>% dplyr::select(ORDERED_GALLONS, LOADED_GALLONS, DELIVERED_GALLONS) %>%pivot_longer(cols =everything(), names_to ="gallon_type", values_to ="count") %>%mutate(gallon_type =factor(gallon_type, levels =c("ORDERED_GALLONS", "LOADED_GALLONS", "DELIVERED_GALLONS")))# Define border colors based on gallon_typeborder_colors_gallons <-c("ORDERED_GALLONS"="grey", "LOADED_GALLONS"="coral", "DELIVERED_GALLONS"="darkred")# Plot with histograms for gallonsggplot(op_data_long_gallons, aes(x = count)) +geom_histogram(binwidth =1, aes(fill = gallon_type, color = gallon_type), alpha =0.7) +facet_wrap(~gallon_type, scales ="fixed", nrow =1, labeller =as_labeller(c("ORDERED_GALLONS"="Ordered", "LOADED_GALLONS"="Loaded", "DELIVERED_GALLONS"="Delivered"))) +scale_y_continuous(trans ='log10', breaks = scales::trans_breaks("log10", function(x) 10^x), labels = scales::trans_format("log10", scales::math_format(10^.x))) +scale_color_manual(values = border_colors_gallons) +scale_x_continuous(limits =c(0, 1000)) +# Limit the x-axis to 1000theme_minimal() +labs(title ="Histograms of Gallon Counts", x ="Gallon Count", y ="Frequency (Log Scale)") +theme(strip.background =element_blank(), strip.text =element_text(color ="black", size =9), panel.grid.major.x =element_blank(), panel.grid.minor =element_blank(), panel.grid.major.y =element_line(color ="grey", size =0.5), axis.title =element_text(size =7),axis.text =element_text(size =6),plot.title =element_text(size =10, face ="bold", hjust =0.5), strip.text.x =element_text(size =8, hjust =0.5), legend.position ="none", axis.text.y =element_text(size =7), axis.title.y =element_text(size =8), panel.spacing =unit(1, "lines") )
We limited the histograms of gallon counts per transaction to 1000 for better visualization. There are only a few operations that exceed this limit. The vast majority of transactions do not exceed 500 gallons.
Code
# Transforming the data to long format for RETURNED_GALLONSop_data_long_returned_gallons <- op_data %>% dplyr::select(RETURNED_GALLONS) %>%pivot_longer(cols =everything(), names_to ="gallon_type", values_to ="count") %>%mutate(gallon_type =factor(gallon_type, levels =c("RETURNED_GALLONS")))# Define border colors for RETURNED_GALLONSborder_colors_returned_gallons <-c("RETURNED_GALLONS"="black")# Plot with histogram for RETURNED_GALLONSggplot(op_data_long_returned_gallons, aes(x = count)) +geom_histogram(binwidth =1, aes(fill = gallon_type, color = gallon_type), alpha =0.7) +scale_y_continuous(trans ='log10', breaks = scales::trans_breaks("log10", function(x) 10^x), labels = scales::trans_format("log10", scales::math_format(10^.x))) +scale_color_manual(values = border_colors_returned_gallons) +scale_x_continuous(limits =c(0, 500)) +# Limit the x-axis to 500theme_minimal() +labs(title ="Returned Gallon Counts", x ="Gallon Count", y ="Frequency (Log Scale)") +theme(strip.background =element_blank(), strip.text =element_text(color ="black", size =9), panel.grid.major.x =element_blank(), panel.grid.minor =element_blank(), panel.grid.major.y =element_line(color ="grey", size =0.5), axis.title =element_text(size =7),axis.text =element_text(size =6),plot.title =element_text(size =10, face ="bold", hjust =0.5), strip.text.x =element_text(size =8, hjust =0.5), legend.position ="none", axis.text.y =element_text(size =7), axis.title.y =element_text(size =8), panel.spacing =unit(1, "lines") )
The number of returned gallon transactions is much lower compared to cases. Overall, these transactions do not exceed 100 gallons.
4.11 Transaction Dates Overview
Code
# Aggregate the transactions by month/year for gallons and cases deliveredop_data_monthly_delivery <- op_data %>%mutate(Month_Year =floor_date(TRANSACTION_DATE, "month")) %>%group_by(Month_Year) %>%summarise(Total_Delivered_Cases =sum(DELIVERED_CASES, na.rm =TRUE),Total_Delivered_Gallons =sum(DELIVERED_GALLONS, na.rm =TRUE))# Reshape the data to long format for facet_wrapop_data_long_delivery <- op_data_monthly_delivery %>%pivot_longer(cols =starts_with("Total_Delivered"), names_to ="Event", values_to ="Value")# Create the plot with the same Y-axis scale for both eventsggplot(op_data_long_delivery, aes(x = Month_Year, y = Value, fill = Event)) +geom_bar(stat ="identity", position ="dodge") +facet_wrap(~ Event, scales ="fixed", ncol =1) +# Use facet_wrap with a shared x-axis and same scale for bothlabs(title ="Monthly Delivered Cases and Gallons JAN 2023 - DEZ 2024",x ="Month",y ="Total Units") +theme_minimal() +theme(legend.position ="none", # Remove the legendaxis.text.x =element_text(size =9), # Adjust the size of x-axis labels for better readabilitypanel.grid.major.x =element_blank(), # Remove vertical grid linespanel.grid.minor.x =element_blank()) +# Remove minor vertical grid linesscale_x_date(labels = scales::date_format("%b"), breaks = scales::date_breaks("1 month")) +# Format x-axis to show only month abbreviations (JAN, FEB, etc.)scale_y_continuous(labels = scales::comma) +# Display Y-axis in full units (e.g., 1000 instead of 1K)scale_fill_manual(values =c("Total_Delivered_Gallons"="#FFCCCB", "Total_Delivered_Cases"="#ADD8E6")) # Set custom colors
The seasonal effect, related to lower temperatures (OCT-MAR), is more pronounced for the number of delivered cases than for gallons. Additionally, this chart highlights the significant difference in consumption between the two, as both quantities are represented on the same scale.
Code
# Aggregate the transactions by month/year for gallons and cases deliveredop_data_monthly_delivery <- op_data %>%mutate(Month_Year =floor_date(TRANSACTION_DATE, "month")) %>%group_by(Month_Year) %>%summarise(Total_Delivered_Cases =sum(DELIVERED_CASES, na.rm =TRUE),Total_Delivered_Gallons =sum(DELIVERED_GALLONS, na.rm =TRUE))# Calculate the percentage of gallons in total (gallons + cases)op_data_monthly_delivery <- op_data_monthly_delivery %>%mutate(Total_Sales = Total_Delivered_Cases + Total_Delivered_Gallons,Percentage_Gallons = (Total_Delivered_Gallons / Total_Sales) *100)# Create the plot with the percentage of gallons soldggplot(op_data_monthly_delivery, aes(x = Month_Year, y = Percentage_Gallons)) +geom_bar(stat ="identity", fill ="#FFCCCB") +# Gallons colorlabs(title ="Percentage of Gallons Sold Relative to Total Sales (23 & 24)",x ="Month",y ="Percentage of Gallons (%)") +theme_minimal() +theme(axis.text.x =element_text(size =9, angle =0, hjust =1), # Rotate x-axis labels for better readabilitypanel.grid.major.x =element_blank(), # Remove vertical grid linespanel.grid.minor.x =element_blank()) +# Remove minor vertical grid linesscale_x_date(labels = scales::date_format("%b"), breaks = scales::date_breaks("1 month")) +# Format x-axis to show month abbreviationsscale_y_continuous(labels = scales::percent_format(scale =1), breaks =seq(0, 100, by =5)) # Set y-axis breaks to show percentages in 5% increments
The sale of gallons over the months remains between 20% and 25% of the total volume.
4.12 Retailer Consumption Quantities
Code
# Count distinct Retailerscat("Number of Retailers:", n_distinct(full_data$PRIMARY_GROUP_NUMBER), "\n")
Number of Retailers: 1021
Code
# Count distinct storescat("Number of Outlets/Stores:", n_distinct(full_data$CUSTOMER_NUMBER), "\n")
Number of Outlets/Stores: 30320
Of the 30,320 stores, many belong to the same chains, with 1,020 networks represented in the dataset. (PRIMARY_GROUP_NUMBER = 0 represents the single stores.)
Code
# Creates the total deliveries by customer type (Single Store or Retailer Group)total_delivered <- full_data %>%mutate(customer_type =ifelse(PRIMARY_GROUP_NUMBER ==0, "Single Store", "Retailer Group")) %>%group_by(customer_type) %>%summarise(qtd_cases_dlv_23 =sum(ifelse(YEAR ==2023, DELIVERED_CASES, 0), na.rm =TRUE),qtd_cases_dlv_24 =sum(ifelse(YEAR ==2024, DELIVERED_CASES, 0), na.rm =TRUE),total_qtd_cases_dlv =sum(DELIVERED_CASES, na.rm =TRUE),total_qtd_gallons_dlv =sum(DELIVERED_GALLONS, na.rm =TRUE) ) %>%ungroup()# Calculates global totals for delivered cases and delivered gallonstotal_cases <-sum(full_data$DELIVERED_CASES, na.rm =TRUE)total_gallons <-sum(full_data$DELIVERED_GALLONS, na.rm =TRUE)# Calculates the percentage for each grouptotal_delivered <- total_delivered %>%mutate(perc_total_qtd_cases = (total_qtd_cases_dlv / total_cases) *100,perc_total_gallons = (total_qtd_gallons_dlv / total_gallons) *100 )# Converts to data.table for efficient processingsetDT(total_delivered)# Rounds percentagestotal_delivered[, perc_total_qtd_cases :=round(perc_total_qtd_cases, 0)]total_delivered[, perc_total_gallons :=round(perc_total_gallons, 0)]# Adds a 'Total' row with global totalstotal_delivered_total <- total_delivered %>%summarise(customer_type ="Total",qtd_cases_dlv_23 =sum(qtd_cases_dlv_23),qtd_cases_dlv_24 =sum(qtd_cases_dlv_24),total_qtd_cases_dlv =sum(total_qtd_cases_dlv),total_qtd_gallons_dlv =sum(total_qtd_gallons_dlv),perc_total_qtd_cases =100,perc_total_gallons =100 ) %>%as.data.table()# Combines the 'Total' row with the previous datatotal_delivered <-rbind(total_delivered, total_delivered_total)# Creates the cases table with the relevant columnscases_table <- total_delivered[, .( customer_type, qtd_cases_dlv_23, qtd_cases_dlv_24, total_qtd_cases_dlv, perc_total_qtd_cases)]# Creates the gallons table with the same columns as the cases tablegallons_table <- total_delivered[, .( customer_type,qtd_gallons_dlv_23 = total_qtd_gallons_dlv, # Corresponding for 2023qtd_gallons_dlv_24 = total_qtd_gallons_dlv, # Corresponding for 2024total_qtd_cases_dlv = total_qtd_gallons_dlv, # Total gallonsperc_total_qtd_cases = perc_total_gallons # Percentage for gallons)]# Creates the total table with the relevant columnstotal_table <- total_delivered[, .( customer_type, qtd_cas_gal_23 = qtd_cases_dlv_23 + total_qtd_gallons_dlv,qtd_cas_gal_24 = qtd_cases_dlv_24 + total_qtd_gallons_dlv,total_qtd_cas_gal = total_qtd_cases_dlv + total_qtd_gallons_dlv,perc_total_qtd = ((total_qtd_cases_dlv + total_qtd_gallons_dlv) / (total_cases + total_gallons)) *100)]# Rounds the percentage for the total tabletotal_table[, perc_total_qtd :=round(perc_total_qtd, 0)]# Format the numeric columns with a thousand separator for all tablesformat_cols_cases <-c("qtd_cases_dlv_23", "qtd_cases_dlv_24", "total_qtd_cases_dlv", "perc_total_qtd_cases")format_cols_gallons <-c("qtd_gallons_dlv_23", "qtd_gallons_dlv_24", "total_qtd_cases_dlv", "perc_total_qtd_cases")format_cols_total <-c("qtd_cas_gal_23", "qtd_cas_gal_24", "total_qtd_cas_gal", "perc_total_qtd")# Format the columns after the tables are createdcases_table[, (format_cols_cases) :=lapply(.SD, function(x) format(x, big.mark =",", scientific =FALSE)), .SDcols = format_cols_cases]gallons_table[, (format_cols_gallons) :=lapply(.SD, function(x) format(x, big.mark =",", scientific =FALSE)), .SDcols = format_cols_gallons]total_table[, (format_cols_total) :=lapply(.SD, function(x) format(x, big.mark =",", scientific =FALSE)), .SDcols = format_cols_total]# Displays casescases_table %>%kable("html", escape =FALSE, align ="c") %>%kable_styling(full_width = F, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:5, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#ADD8E6") %>%# Light blue headeradd_header_above(c("CASES - Statistics by deliveries greater than 0"=5)) %>%kable_paper("striped", full_width = F)
CASES - Statistics by deliveries greater than 0
customer_type
qtd_cases_dlv_23
qtd_cases_dlv_24
total_qtd_cases_dlv
perc_total_qtd_cases
Retailer Group
10,099,875
10,770,367
20,870,242
79
Single Store
2,684,696
2,879,141
5,563,837
21
Total
12,784,571
13,649,508
26,434,079
100
Considering cases, 80% of the volume went to stores that belong to larger groups.
Code
# Displays gallonsgallons_table %>%kable("html", escape =FALSE, align ="c") %>%kable_styling(full_width = F, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:5, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#FFCCCB") %>%# Light red headeradd_header_above(c("GALLONS - Statistics by deliveries greater than 0"=5)) %>%kable_paper("striped", full_width = F)
GALLONS - Statistics by deliveries greater than 0
customer_type
qtd_gallons_dlv_23
qtd_gallons_dlv_24
total_qtd_cases_dlv
perc_total_qtd_cases
Retailer Group
4,565,535
4,565,535
4,565,535
47
Single Store
5,094,657
5,094,657
5,094,657
53
Total
9,660,192
9,660,192
9,660,192
100
As for gallons, the distribution is similar, with 53% going to single stores and 47% to retailer groups, indicating that local stores have a greater share in gallon consumption compared to cases.
Code
# Displays total (cases + gallons)total_table %>%kable("html", escape =FALSE, align ="c") %>%kable_styling(full_width = F, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:5, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="lightgray") %>%# Light blue headeradd_header_above(c("TOTAL - Combined Deliveries Quantities for Cases and Gallons"=5)) %>%kable_paper("striped", full_width = F)
TOTAL - Combined Deliveries Quantities for Cases and Gallons
customer_type
qtd_cas_gal_23
qtd_cas_gal_24
total_qtd_cas_gal
perc_total_qtd
Retailer Group
14,665,410
15,335,902
25,435,777
70
Single Store
7,779,354
7,973,798
10,658,494
30
Total
22,444,764
23,309,700
36,094,271
100
The table below helps to better explore the data presented above.
Code
# Summarize delivered cases and gallons for 2023 and 2024summary_2023 <- full_data %>%filter(YEAR ==2023) %>%group_by(PRIMARY_GROUP_NUMBER) %>%summarise(cas_qtd_dlv23 =sum(DELIVERED_CASES, na.rm =TRUE),gal_qtd_dlv23 =sum(DELIVERED_GALLONS, na.rm =TRUE) )summary_2024 <- full_data %>%filter(YEAR ==2024) %>%group_by(PRIMARY_GROUP_NUMBER) %>%summarise(cas_qtd_dlv24 =sum(DELIVERED_CASES, na.rm =TRUE),gal_qtd_dlv24 =sum(DELIVERED_GALLONS, na.rm =TRUE) )# Merge summaries and compute total valuesgroup_demand <-full_join(summary_2023, summary_2024, by ="PRIMARY_GROUP_NUMBER") %>%mutate(across(c(cas_qtd_dlv23, gal_qtd_dlv23, cas_qtd_dlv24, gal_qtd_dlv24), ~replace_na(., 0)),total_23 = cas_qtd_dlv23 + gal_qtd_dlv23,total_24 = cas_qtd_dlv24 + gal_qtd_dlv24,sum_23_24 = total_23 + total_24 ) %>%rename(PGN = PRIMARY_GROUP_NUMBER) %>%arrange(desc(sum_23_24)) # %>%# filter(PGN != 0) # Exclude rows where PRIMARY_GROUP_NUMBER is 0# Convert to data.table for performancesetDT(group_demand)# Display interactive table with formatted numbers (without changing type)datatable( group_demand, options =list(pageLength =10, autoWidth =TRUE),rownames =FALSE,caption ="Quantity Delivered") %>%formatCurrency(columns =c("cas_qtd_dlv23", "gal_qtd_dlv23", "cas_qtd_dlv24", "gal_qtd_dlv24", "total_23", "total_24", "sum_23_24"),currency ="", # No currency symboldigits =0, # No decimal placesmark =","# Thousands separator )
Code
# List all variables in the environmentall_vars <-ls()# Exclude 'full_data', 'full_data_customer', and the new variables from removalvars_to_keep <-c("full_data", "full_data_customer", "cost_data", "customer_address", "mydir", "one_seed", "op_data", "profile_data", "reference_date","custom_palette")# Get the variables to removevars_to_remove <-setdiff(all_vars, vars_to_keep)# Remove the temporary data framesrm(list = vars_to_remove)# Clean up by removing 'all_vars' and 'vars_to_remove'rm(all_vars, vars_to_remove)
5. Feature Engineering
Considering all the previous analyses, the goal now is to complement the information that can enhance the robustness of the modeling process. Several feature engineering techniques were attempted, but only the most relevant ones will be described.
5.1 Census Data
The data used for updating the location information comes from the U.S. Census Bureau, specifically the American Community Survey (ACS), which annually adjusts its results based on the most recent data. For 2023, the ACS data was retrieved, which is adjusted using the 2020 Census data. However, data for 2024 was not yet available at the time of retrieval.
The decision to use coordinates for store locations, even when there are multiple instances of identical coordinates across different ZIP codes, was made due to the challenges encountered when retrieving Census data based on ZIP codes. Different stores or customers within the same ZIP code can share coordinates, particularly in areas like shopping centers with multiple businesses.
Below are the descriptions of the import data:
Code
#Creating the data for the table census_data <-tibble(variable =c("MED_HH_INC", "GINI_IDX", "PER_CAP_INC", "MED_HOME_VAL", "POV_POP", "INC_LVL_1", "INC_LVL_2", "INC_LVL_3", "INC_LVL_4", "INC_LVL_5", "INC_LVL_6", "INC_LVL_7", "INC_LVL_8", "INC_LVL_9", "INC_LVL_10", "INC_LVL_11", "INC_LVL_12", "INC_LVL_13", "INC_LVL_14", "INC_LVL_15", "INC_LVL_16", "TOT_HOUS_UNITS", "VAC_HOUS_UNITS", "MED_GROSS_RENT", "BACH_DEG", "MAST_DEG", "DOC_DEG", "UNEMP_POP", "EMP_POP", "TOT_WORK_POP", "SNAP_HH", "MED_FAM_INC", "TOT_POP", "MALE_POP", "FEMALE_POP", "COMMUTE_POP", "COMMUTE_POP_DRIVE" ),description =c("Median household income", "Gini index of income inequality", "Per capita income", "Median home value", "Population below poverty", "Income less than $10,000", "$10,000 to $14,999", "$15,000 to $19,999", "$20,000 to $24,999", "$25,000 to $29,999", "$30,000 to $34,999", "$35,000 to $39,999", "$40,000 to $44,999", "$45,000 to $49,999", "$50,000 to $59,999", "$60,000 to $74,999", "$75,000 to $99,999", "$100,000 to $124,999", "$125,000 to $149,999", "$150,000 to $199,999", "$200,000 or more", "Total housing units", "Vacant housing units", "Median gross rent", "Bachelor's degree holders", "Master's degree holders", "Doctoral degree holders", "Unemployed population", "Employed population", "Total working population", "Food stamp households", "Median family income", "Total population", "Male population", "Female population", "Total commuter population", "Total commuter population driving" ) )#Tabledatatable(census_data, options =list(scrollX =TRUE, pageLength =10), caption ="List of Census Variables and Descriptions")
Code
library(tidycensus)library(sf) # Census Bureau API key#census_api_key(" ", install = TRUE)# Create a copy of full_data_customer with only the relevant columnsdata_sf <- full_data_customer %>% dplyr::select(CUSTOMER_NUMBER, LONGITUDE, LATITUDE)# Convert customer data to sf objectdata_sf <- data_sf %>%st_as_sf(coords =c("LONGITUDE", "LATITUDE"), crs =4326)# Ensure the 'census_variables' object is definedcensus_variables <-tibble(code =c("B19013_001", "B19083_001", "B19301_001", "B25077_001", "B17001_002", "B19001_002", "B19001_003", "B19001_004", "B19001_005", "B19001_006", "B19001_007", "B19001_008", "B19001_009", "B19001_010", "B19001_011", "B19001_012", "B19001_013", "B19001_014", "B19001_015", "B19001_016", "B19001_017", "B25001_001", "B25002_003", "B25064_001", "B15003_017", "B15003_022", "B15003_025", "B23025_005", "B23025_004", "B24011_001", "B22001_002", "B19058_001", "B01003_001", "B01001_002", "B01001_026", "B08006_001", "B08006_002" ),description =c("MED_HH_INC", "GINI_IDX", "PER_CAP_INC", "MED_HOME_VAL", "POV_POP", "INC_LVL_1", "INC_LVL_2", "INC_LVL_3", "INC_LVL_4", "INC_LVL_5", "INC_LVL_6", "INC_LVL_7", "INC_LVL_8", "INC_LVL_9", "INC_LVL_10", "INC_LVL_11", "INC_LVL_12", "INC_LVL_13", "INC_LVL_14", "INC_LVL_15", "INC_LVL_16", "TOT_HOUS_UNITS", "VAC_HOUS_UNITS", "MED_GROSS_RENT", "BACH_DEG", "MAST_DEG", "DOC_DEG", "UNEMP_POP", "EMP_POP", "TOT_WORK_POP", "SNAP_HH", "MED_FAM_INC", "TOT_POP", "MALE_POP", "FEMALE_POP", "COMMUTE_POP", "COMMUTE_POP_DRIVE" ),full_description =c("Median household income", "Gini index of income inequality", "Per capita income", "Median home value", "Population below poverty", "Income less than $10,000", "$10,000 to $14,999", "$15,000 to $19,999", "$20,000 to $24,999", "$25,000 to $29,999", "$30,000 to $34,999", "$35,000 to $39,999", "$40,000 to $44,999", "$45,000 to $49,999", "$50,000 to $59,999", "$60,000 to $74,999", "$75,000 to $99,999", "$100,000 to $124,999", "$125,000 to $149,999", "$150,000 to $199,999", "$200,000 or more", "Total housing units", "Vacant housing units", "Median gross rent", "Bachelor's degree holders", "Master's degree holders", "Doctoral degree holders", "Unemployed population", "Employed population", "Total working population", "Food stamp households", "Median family income", "Total population", "Male population", "Female population", "Total commuter population", "Total commuter population driving" ))# Retrieve ACS dataacs_data <-get_acs(geography ="tract",variables = census_variables$code,year =2023,state =unique(full_data_customer$STATE),geometry =TRUE)# Merge with descriptionsacs_data <- acs_data %>%left_join(census_variables, by =c("variable"="code"))# Transform CRS to match customer datadata_sf <-st_transform(data_sf, st_crs(acs_data))# Perform spatial joinjoined_data_sf <-st_join(data_sf, acs_data, join = st_intersects)# Reshape the dataset, keeping only the 'estimate' valuescensus <- joined_data_sf %>%mutate(variable_name =if_else(variable %in% census_variables$code, description, variable) ) %>%pivot_wider(names_from = variable_name,values_from = estimate,names_glue ="{variable_name}" )# Select only the required columnscensus <- census %>% dplyr::select( CUSTOMER_NUMBER, MED_HH_INC, GINI_IDX, PER_CAP_INC, MED_HOME_VAL, POV_POP, INC_LVL_1, INC_LVL_2, INC_LVL_3, INC_LVL_4, INC_LVL_5, INC_LVL_6, INC_LVL_7, INC_LVL_8, INC_LVL_9, INC_LVL_10, INC_LVL_11, INC_LVL_12, INC_LVL_13, INC_LVL_14, INC_LVL_15, INC_LVL_16, TOT_HOUS_UNITS, VAC_HOUS_UNITS, MED_GROSS_RENT, BACH_DEG, MAST_DEG, DOC_DEG, UNEMP_POP, EMP_POP, TOT_WORK_POP, SNAP_HH, MED_FAM_INC, TOT_POP, MALE_POP, FEMALE_POP, COMMUTE_POP, COMMUTE_POP_DRIVE )# Remove the geometry column and convert to a normal data framecensus <- census %>%st_drop_geometry() %>%as.data.frame()# Handle missing and infinite values (replace -Inf with NA)census[census ==-Inf] <-NA# Optionally impute missing values or remove themcensus[is.na(census)] <-0# You could also choose to impute using other strategies# Aggregate census data by CUSTOMER_NUMBER, keeping the highest value for each columncensus <- census %>%group_by(CUSTOMER_NUMBER) %>%summarise(across(everything(), max, na.rm =TRUE), .groups ="drop")# Perform the join between full_data_customer and census on the CUSTOMER_NUMBER columnfull_data_customer <- full_data_customer %>% dplyr::left_join(census, by ="CUSTOMER_NUMBER")# Remove any duplicated columns or columns with ".x" suffixesfull_data_customer <- full_data_customer %>% dplyr::select(-ends_with(".x")) %>% dplyr::rename_with(~gsub("\\.y$", "", .), ends_with(".y"))# Transforming variable types before savefull_data_customer$COLD_DRINK_CHANNEL <-as.factor(full_data_customer$COLD_DRINK_CHANNEL)full_data_customer$TRADE_CHANNEL <-as.factor(full_data_customer$TRADE_CHANNEL)full_data_customer$SUB_TRADE_CHANNEL <-as.factor(full_data_customer$SUB_TRADE_CHANNEL)
During the modeling process, it became clear that the absence of 2024 data limited the analysis. In addition, correlations between the census variables and, in particular, customer demand volumes were very low. Because of this, these variables were not explored further in the document. The goal is for this initial process to serve as a foundation for future analyses.
5.2 RFM Score
The RFM (Recency, Frequency, Monetary) analysis segments customers based on purchasing behavior, providing insights into consumption patterns. Adapting this model to analyze customer orders helps assess both the frequency and volume of purchases.
5.2.1 Frequency - Days Between Orders
To adapt the RFM analysis by considering purchase periods and quantities ordered, the analysis will focus on customer orders. Before calculating the number of days between orders (frequency), the total number of orders per customer will be determined, considering only those with a quantity of gallons or cases greater than 0.
Code
# Filter valid transactions (ORDERED_CASES > 0 or ORDERED_GALLONS > 0)valid_orders <- full_data %>%filter(ORDERED_CASES >0| ORDERED_GALLONS >0)# Calculate the number of orders > 0 per customerorders_per_customer <- valid_orders %>%group_by(CUSTOMER_NUMBER) %>%summarise(NUM_ORDERS =n(), .groups ="drop") %>%ungroup()# Add the column NUM_ORDERS in full_data_customerfull_data_customer <- full_data_customer %>%left_join(orders_per_customer, by ="CUSTOMER_NUMBER")# Find customers who do not meet the condition (NO valid transactions)customers_not_meeting_filter <- full_data_customer %>%filter(is.na(NUM_ORDERS)) %>%summarise(unique_customers =n_distinct(CUSTOMER_NUMBER))# Print the number of unique customers who don't meet the filter#print(customers_not_meeting_filter)# Remove unnecessary intermediate data framesrm(valid_orders, orders_per_customer,customers_not_meeting_filter)
There are 135 customers who do not have order transactions greater than zero in the dataset; for these customers, I will consider the number of delivery transactions as orders.
Code
# Filter customers with NUM_ORDERS == NAcustomers_with_na_orders <- full_data_customer %>%filter(is.na(NUM_ORDERS)) %>% dplyr::select(CUSTOMER_NUMBER) %>%distinct()# Filter valid delivery transactions (DELIVERED_CASES > 0 or DELIVERED_GALLONS > 0) in full_datavalid_deliveries <- full_data %>%filter(DELIVERED_CASES >0| DELIVERED_GALLONS >0)# Calculate the number of valid deliveries per customer with NUM_ORDERS == NAdeliveries_per_customer <- valid_deliveries %>%filter(CUSTOMER_NUMBER %in% customers_with_na_orders$CUSTOMER_NUMBER) %>%group_by(CUSTOMER_NUMBER) %>%summarise(NUM_DELIVERIES =n()) %>%ungroup()# Update NUM_ORDERS only for customers with NUM_ORDERS == NAfull_data_customer <- full_data_customer %>%left_join(deliveries_per_customer, by ="CUSTOMER_NUMBER") %>%mutate(NUM_ORDERS =if_else(is.na(NUM_ORDERS), NUM_DELIVERIES, NUM_ORDERS) ) %>% dplyr::select(-NUM_DELIVERIES) # Drop the temporary NUM_DELIVERIES column# Ensure full_data has the NUM_ORDERS column with the same values as full_data_customerfull_data <- full_data %>%left_join(full_data_customer %>% dplyr::select(CUSTOMER_NUMBER, NUM_ORDERS), by ="CUSTOMER_NUMBER")# Remove unnecessary intermediate data framesrm(customers_with_na_orders, valid_deliveries, deliveries_per_customer)
Considering all the order transactions recorded in 2023 and 2024, each unique customer has a minimum of 1 transaction and a maximum of 392 transactions.
To better understand the consumption profile of each customer, below we will visualize the number of customers in transaction bins where the orders of cases or gallons were greater than 0. For the 135 unique customers who did not have order transactions but received volume, we considered these operations as orders.
Code
# Count the number of valid transactions per customercustomers_by_bin <- full_data_customer %>%group_by(CUSTOMER_NUMBER) %>%summarise(transaction_count =sum(NUM_ORDERS, na.rm =TRUE), .groups ="drop") %>%mutate(transaction_bin =case_when( transaction_count ==1~"1", transaction_count >=2& transaction_count <=10~"2-10", transaction_count >=11& transaction_count <=20~"11-20", transaction_count >=21& transaction_count <=30~"21-30", transaction_count >=31& transaction_count <=40~"31-40", transaction_count >=41& transaction_count <=50~"41-50", transaction_count >=51& transaction_count <=100~"51-100", transaction_count >=101& transaction_count <=200~"101-200", transaction_count >=201& transaction_count <=300~"201-300", transaction_count >300~">300",TRUE~"Other" )) %>%mutate(transaction_bin =factor(transaction_bin, levels =c("1", "2-10", "11-20", "21-30", "31-40", "41-50", "51-100", "101-200", "201-300", ">300"))) %>%group_by(transaction_bin) %>%summarise(unique_customers =n_distinct(CUSTOMER_NUMBER), .groups ="drop") %>%arrange(transaction_bin)# Create a bar plot resembling a histogram of unique customers per transaction binggplot(customers_by_bin, aes(x = transaction_bin, y = unique_customers, fill = transaction_bin)) +geom_bar(stat ="identity", show.legend =FALSE) +geom_text(aes(label = unique_customers), vjust =-0.3, size =3, color ="black") +# Add customer counts above barsscale_fill_brewer(palette ="Set3") +# Use RColorBrewer's Set3 palettelabs(title ="Number of Unique Customers by Transaction Count (Orders > 0)",x ="Transaction Count Bins",y ="Number of Unique Customers") +theme_minimal() +theme(axis.text.x =element_text(hjust =0.5, vjust =0.5), # Centered x-axis labels without rotationpanel.grid.major.x =element_blank(), # Remove vertical grid linespanel.grid.minor.x =element_blank(), # Remove minor vertical grid linesaxis.text =element_text(size =9), # Set the size of axis labelsaxis.title =element_text(size =10) # Set the size of axis titles )
Code
# Remove unnecessary intermediate data framesrm(customers_by_bin)
The histogram shows that 1,218 customers have only one order transaction, making it impossible to calculate the days between orders. Additionally, 6,798 customers have between 2 and 10 orders. To ensure more reliable figures, we will consider only customers with at least 11 orders for this indicator. As a result, all customers with fewer transactions will be assigned a value of 731 days between orders, indicating low order frequency over a two-year range.
Code
# Calculate the number of days between orders for customers with NUM_ORDERS >= 11full_data <- full_data %>%arrange(CUSTOMER_NUMBER, TRANSACTION_DATE) %>%# Sort by CUSTOMER_NUMBER and TRANSACTION_DATEgroup_by(CUSTOMER_NUMBER) %>%mutate(DAYS_BETWEEN_ORD =case_when( NUM_ORDERS <=10~731, # Set DAYS_BETWEEN_ORD to 731 for customers with NUM_ORDERS <= 10 NUM_ORDERS >=11& (ORDERED_CASES >0| ORDERED_GALLONS >0) ~as.numeric(difftime(TRANSACTION_DATE, lag(TRANSACTION_DATE), units ="days")), # Calculate days between orders for NUM_ORDERS >= 11 where ORDERED_CASES or ORDERED_GALLONS > 0 NUM_ORDERS >=11&!(ORDERED_CASES >0| ORDERED_GALLONS >0) &# Only apply this when the previous condition fails (DELIVERED_CASES >0| DELIVERED_GALLONS >0) ~as.numeric(difftime(TRANSACTION_DATE, lag(TRANSACTION_DATE), units ="days")), # If no ORDERED_CASES or ORDERED_GALLONS > 0, calculate with DELIVERED_CASES or DELIVERED_GALLONSTRUE~NA_real_# For all other cases )) %>%ungroup()# Calculate the average days between orders per customer and round the result to the nearest integeravg_days_per_customer <- full_data %>%group_by(CUSTOMER_NUMBER) %>%summarise(AVG_DAYS_BET_ORD =round(mean(DAYS_BETWEEN_ORD, na.rm =TRUE), 0)) %>%# Round to nearest integerungroup()# Update full_data_customer with the average days between ordersfull_data_customer <- full_data_customer %>%left_join(avg_days_per_customer, by ="CUSTOMER_NUMBER")# Remove temporary variablesrm(avg_days_per_customer)
Code
# Count the number of unique customers in each days between orders bin without adding a new column to the datasetcustomers_by_bin <- full_data_customer %>%mutate(DAYS_BETWEEN_ORD_BIN =case_when( AVG_DAYS_BET_ORD >=1& AVG_DAYS_BET_ORD <=10~"1-10 days", AVG_DAYS_BET_ORD >10& AVG_DAYS_BET_ORD <=20~"11-20 days", AVG_DAYS_BET_ORD >20& AVG_DAYS_BET_ORD <=30~"21-30 days", AVG_DAYS_BET_ORD >30& AVG_DAYS_BET_ORD <=33~"31-40 days", AVG_DAYS_BET_ORD >40& AVG_DAYS_BET_ORD <=50~"41-50 days", AVG_DAYS_BET_ORD >50~"Above 50 days",TRUE~"One Order Only" )) %>%group_by(DAYS_BETWEEN_ORD_BIN) %>%summarise(unique_customers =n_distinct(CUSTOMER_NUMBER), .groups ="drop") %>%mutate(percentage_customers = unique_customers /sum(unique_customers) *100) %>%# Calculate percentagearrange(DAYS_BETWEEN_ORD_BIN)# Create a bar plot resembling a histogram of unique customers percentage per days between orders binggplot(customers_by_bin, aes(x = DAYS_BETWEEN_ORD_BIN, y = percentage_customers, fill = DAYS_BETWEEN_ORD_BIN)) +geom_bar(stat ="identity", show.legend =FALSE) +geom_text(aes(label = scales::percent(percentage_customers /100)), vjust =-0.3, size =3) +# Add percentage labels above barsscale_fill_brewer(palette ="Set3") +# Use RColorBrewer's Set3 palettelabs(title ="Percentage of Unique Customers by Days Between Orders",x ="Days Between Orders",y ="Percentage of Unique Customers") +theme_minimal() +theme(axis.text.x =element_text(hjust =0.5, vjust =0.5), # Centered x-axis labels without rotationpanel.grid.major.x =element_blank(), # Remove vertical grid linespanel.grid.minor.x =element_blank(), # Remove minor vertical grid linesaxis.text =element_text(size =9), # Set the size of axis labelsaxis.title =element_text(size =10) # Set the size of axis titles )
Code
# Remove unnecessary intermediate data framesrm(customers_by_bin)
Around 20% of customers had an average order interval of up to 10 days, while 44% showed an average interval of more than 30 days.Approximately 5% of customers placed only one order, making it impossible to calculate the number of days between orders.
5.2.2 Recency - Time Since Last Order
To calculate recency, I will consider the number of days between the date of the last order and 01-01-2025.
Code
# Create the LAST_ORDER_DATE column, excluding rows where all specified columns are zerofull_data <- full_data %>%group_by(CUSTOMER_NUMBER) %>%mutate(LAST_ORDER_DATE =if_else( (ORDERED_CASES >0| ORDERED_GALLONS >0) &!(ORDERED_CASES ==0& ORDERED_GALLONS ==0& LOADED_CASES ==0& LOADED_GALLONS ==0& DELIVERED_CASES ==0& DELIVERED_GALLONS ==0),as.character(max(TRANSACTION_DATE, na.rm =TRUE)), NA_character_ ) ) %>%ungroup()
There are 5,754 transaction rows where assigning the last transaction date based on orders is not possible. For these, the date of the last delivery operation will be used as the reference date. The last two transactions, referring to return transactions, will be excluded.
Code
# For customers with LAST_ORDER_DATE as NA, consider the latest TRANSACTION_DATE where DELIVERED_CASES or DELIVERED_GALLONS > 0full_data <- full_data %>%mutate(LAST_ORDER_DATE =as.Date(LAST_ORDER_DATE)) %>%# Convert LAST_ORDER_DATE to Date formatgroup_by(CUSTOMER_NUMBER) %>%mutate(LAST_ORDER_DATE =if_else(is.na(LAST_ORDER_DATE) & (ORDERED_CASES ==0& ORDERED_GALLONS ==0),as.Date(max(TRANSACTION_DATE[DELIVERED_CASES >0| DELIVERED_GALLONS >0], na.rm =TRUE)), LAST_ORDER_DATE ) ) %>%ungroup()# Remove the last 2 rows where LAST_ORDER_DATE is NA (return operations only)full_data <- full_data %>%filter(!is.na(LAST_ORDER_DATE))# Remove rows where LAST_ORDER_DATE is Inf (return operations only)full_data <- full_data %>%filter(!is.infinite(LAST_ORDER_DATE))# Reference Datereference_date <-as.Date("2025-01-01")# Create the DAYS_AF_LAST_ORD column in full_datafull_data <- full_data %>%mutate(DAYS_AF_LAST_ORD =ifelse(!is.na(LAST_ORDER_DATE), as.numeric(difftime(reference_date, LAST_ORDER_DATE, units ="days")),NA_real_))# Aggregate full_data to get the latest LAST_ORDER_DATE and DAYS_AF_LAST_ORD for each CUSTOMER_NUMBERfull_data_aggregated <- full_data %>%group_by(CUSTOMER_NUMBER) %>%summarise(LAST_ORDER_DATE =max(LAST_ORDER_DATE, na.rm =TRUE),DAYS_AF_LAST_ORD =max(DAYS_AF_LAST_ORD, na.rm =TRUE),.groups ='drop' )# Join the aggregated data with full_data_customerfull_data_customer <- full_data_customer %>%left_join(full_data_aggregated, by ="CUSTOMER_NUMBER")# # Remove unnecessary intermediate data framesrm(full_data_aggregated)
5.2.3 Total Quantity Ordered
As there is no access to the prices charged, and considering that they likely vary among customer types and demanded volumes, the focus will be on the quantities demanded instead of monetary values. This approach aligns with the current objective of customer segmentation.
Code
# Calculate the total ordered by customer by summing ORDERED_CASES and ORDERED_GALLONStotal_ordered_per_customer <- full_data %>%group_by(CUSTOMER_NUMBER) %>%summarise(TOTAL_ORDERED =sum(ORDERED_CASES, na.rm =TRUE) +sum(ORDERED_GALLONS, na.rm =TRUE)) %>%ungroup()# Add the TOTAL_ORDERED column to full_data_customer by CUSTOMER_NUMBERfull_data_customer <- full_data_customer %>%left_join(total_ordered_per_customer, by ="CUSTOMER_NUMBER")# Identify customers with TOTAL_ORDERED == 0customers_with_zero_ordered <- total_ordered_per_customer %>%filter(TOTAL_ORDERED ==0)# For those customers, calculate DELIVERED_CASES + DELIVERED_GALLONS from full_datadeliveries_for_zero_orders <- full_data %>%filter(CUSTOMER_NUMBER %in% customers_with_zero_ordered$CUSTOMER_NUMBER) %>%group_by(CUSTOMER_NUMBER) %>%summarise(DELIVERED_TOTAL =sum(DELIVERED_CASES, na.rm =TRUE) +sum(DELIVERED_GALLONS, na.rm =TRUE)) %>%ungroup()# Merge the delivery values into the total_ordered_per_customer dataframe,# ensuring that if TOTAL_ORDERED is zero, it is replaced by DELIVERED_TOTALtotal_ordered_per_customer <- total_ordered_per_customer %>%left_join(deliveries_for_zero_orders, by ="CUSTOMER_NUMBER") %>%mutate(TOTAL_ORDERED =if_else(TOTAL_ORDERED ==0, DELIVERED_TOTAL, TOTAL_ORDERED) ) %>% dplyr::select(CUSTOMER_NUMBER, TOTAL_ORDERED)# Add the updated TOTAL_ORDERED column to full_data_customer by CUSTOMER_NUMBERfull_data_customer <- full_data_customer %>%left_join(total_ordered_per_customer, by ="CUSTOMER_NUMBER")# Remove the 'TOTAL_ORDERED.x' column and rename 'TOTAL_ORDERED.y' to 'TOTAL_ORDERED'full_data_customer <- full_data_customer %>% dplyr::select(-TOTAL_ORDERED.x) %>% dplyr::rename(TOTAL_ORDERED = TOTAL_ORDERED.y)# Remove unnecessary intermediate data framesrm(total_ordered_per_customer, customers_with_zero_ordered, deliveries_for_zero_orders)
5.2.4 Adapted RFM Score
Scores were assigned to classes based on the distribution of the created variables. The total score, combined with its relative weight, formed the RFM_SCORE, which served as an additional variable for customer analysis and segmentation.
To define these scores, the quantitative distribution of each variable was used, especially considering the wide range observed in some of them. Each variable received a score from 1 to 10. In the case of frequency, two separate variables were created, and weight was given not only to the number of orders but also to the interval between them. As a result, the total score ranged from 4 to 40.
Code
# Remove previously created columns#full_data_customer <- full_data_customer %>%# dplyr::select(-FREQUENCY_SCORE, -RECENCY_SCORE, -VOLUME_SCORE, -RFM_SCORE)# Create Frequency Score based on NUM_ORDERSfull_data_customer <- full_data_customer %>%mutate(ORDER_FREQUENCY_SCORE =case_when( NUM_ORDERS >=300~10, NUM_ORDERS >=200~9, NUM_ORDERS >=150~8, NUM_ORDERS >=100~7, NUM_ORDERS >=75~6, NUM_ORDERS >=50~5, # 3rd quartile NUM_ORDERS >=35~4, # Mean NUM_ORDERS >=23~3, # Median NUM_ORDERS >=10~2, # 1st quartileTRUE~1 ),ORDER_INTERVAL_SCORE =case_when( AVG_DAYS_BET_ORD <=5~10, AVG_DAYS_BET_ORD <=13~9, # 1st quartile AVG_DAYS_BET_ORD <=20~8, AVG_DAYS_BET_ORD <=26~7, # Median AVG_DAYS_BET_ORD <=30~6, AVG_DAYS_BET_ORD <=50~5, AVG_DAYS_BET_ORD <=100~4, AVG_DAYS_BET_ORD <=210~3, # Mean AVG_DAYS_BET_ORD <=300~2,TRUE~1 ) )# Create Recency Score based on DAYS_AF_LAST_ORDfull_data_customer <- full_data_customer %>%mutate(RECENCY_SCORE =case_when( DAYS_AF_LAST_ORD <=7~10, DAYS_AF_LAST_ORD <=13~9, # 1st quartile DAYS_AF_LAST_ORD <=20~8, DAYS_AF_LAST_ORD <=27~7, #Median DAYS_AF_LAST_ORD <=40~6, DAYS_AF_LAST_ORD <=50~5, DAYS_AF_LAST_ORD <=72~4, #Mean DAYS_AF_LAST_ORD <=90~3, #3rd quartile DAYS_AF_LAST_ORD <=180~2, #Six monthsTRUE~1 ) )# Create Volume Score based on TOTAL_ORDEREDfull_data_customer <- full_data_customer %>%mutate(VOLUME_SCORE =case_when( TOTAL_ORDERED >=300000~10, TOTAL_ORDERED >=100000~9, TOTAL_ORDERED >=5000~8, TOTAL_ORDERED >=2000~7, TOTAL_ORDERED >=1267~6, # Mean TOTAL_ORDERED >=815~5, # 3rd quartile TOTAL_ORDERED >=400~4, # Threshold TOTAL_ORDERED >=302~3, # Median TOTAL_ORDERED >=200~2, TRUE~1 ) )# Calculate the overall RFM Score as the sum of Recency, Frequency, Order Interval, and Volume scoresfull_data_customer <- full_data_customer %>%mutate(RFM_SCORE = RECENCY_SCORE + ORDER_FREQUENCY_SCORE + ORDER_INTERVAL_SCORE + VOLUME_SCORE )# Count the number of customers in each RFM_SCORE rangerfm_distribution <- full_data_customer %>%mutate(RFM_CATEGORY =case_when( RFM_SCORE <=10~"4-10", RFM_SCORE <=20~"11-20", RFM_SCORE <=30~"21-30",TRUE~"31-40" )) %>%group_by(RFM_CATEGORY) %>%summarise(CUSTOMER_COUNT =n(), .groups ="drop") %>%mutate(PERCENTAGE = CUSTOMER_COUNT /sum(CUSTOMER_COUNT) *100)# Reorder RFM_CATEGORY to ensure it starts with scores between 4 and 10rfm_distribution$RFM_CATEGORY <-factor(rfm_distribution$RFM_CATEGORY, levels =c("4-10", "11-20", "21-30", "31-40"))# Plot the distribution of RFM scoresggplot(rfm_distribution, aes(x = RFM_CATEGORY, y = PERCENTAGE, fill = RFM_CATEGORY)) +geom_bar(stat ="identity", show.legend =FALSE) +geom_text(aes(label =paste0(round(PERCENTAGE, 1), "%")), vjust =-0.3, size =4) +scale_fill_brewer(palette ="Set3") +# Use Set3 color palettelabs(title ="Distribution of Customers by RFM Score",x ="RFM Score Range",y ="Percentage of Customers") +theme_minimal() +theme(axis.text.x =element_text(hjust =0.5),panel.grid.major.x =element_blank(),panel.grid.minor.x =element_blank(),axis.text =element_text(size =10),axis.title =element_text(size =11) )
Code
# Remove unnecessary intermediate data framerm(rfm_distribution)
The adapted RFM Score is a method developed to condense various pieces of information related to store consumption. It was observed that 60% of stores have a score up to 20 (the median), 32% have scores between 21-30, and 8.5% have scores above 30. This suggests that only a small percentage of stores exhibit high consumption patterns.
Code
# Filter only customers where LOCAL_FOUNT_ONLY == 1rfm_distribution_lfo <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%mutate(RFM_CATEGORY =case_when( RFM_SCORE <=10~"4-10", RFM_SCORE <=20~"11-20", RFM_SCORE <=30~"21-30",TRUE~"31-40" )) %>%group_by(RFM_CATEGORY) %>%summarise(CUSTOMER_COUNT =n(), .groups ="drop") %>%mutate(PERCENTAGE = CUSTOMER_COUNT /sum(CUSTOMER_COUNT) *100)# Reorder RFM_CATEGORY to ensure it starts with scores between 4 and 10rfm_distribution_lfo$RFM_CATEGORY <-factor(rfm_distribution_lfo$RFM_CATEGORY, levels =c("4-10", "11-20", "21-30", "31-40"))# Plot the distribution of RFM scores for LOCAL_FOUNT_ONLY == 1ggplot(rfm_distribution_lfo, aes(x = RFM_CATEGORY, y = PERCENTAGE, fill = RFM_CATEGORY)) +geom_bar(stat ="identity", show.legend =FALSE) +geom_text(aes(label =paste0(round(PERCENTAGE, 1), "%")), vjust =-0.3, size =4) +scale_fill_brewer(palette ="Set3") +# Use Set3 color palettelabs(title ="Distribution of Customers by RFM Score (LOCAL_FOUNT_ONLY = 1)",x ="RFM Score Range",y ="Percentage of Customers") +theme_minimal() +theme(axis.text.x =element_text(hjust =0.5),panel.grid.major.x =element_blank(),panel.grid.minor.x =element_blank(),axis.text =element_text(size =10),axis.title =element_text(size =11) )
Code
# Remove unnecessary intermediate data framerm(rfm_distribution_lfo)
For customers who are local partners and consume only fountain drinks, it is clear that their consumption patterns are even lower. Nearly 74% of them have scores up to 20, and among the remaining customers, less than 3.6% have scores above 30.
5.3 Customer Demand and Growth
5.3.1 Low Demand Customers
It is known that a few customers exhibit very high consumption volumes, causing the average to be skewed above the median. The table below explores metrics related to customers whose demand falls below the first quartile.
Code
# Summarize the metricsdata_summary <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Avg_Vol_Cust =round(mean((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), na.rm =TRUE)),Median_Vol_Cust =round(median((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), na.rm =TRUE)),First_Quartile_Vol =round(quantile((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), 0.25, na.rm =TRUE)),.groups ='drop' )# Calculate the first quartile for each channelquartile_data <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(First_Quartile_Val =round(quantile((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), 0.25, na.rm =TRUE)),Tot_Vol =sum((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), na.rm =TRUE),.groups ='drop' )# Calculate the number of customers below the first quartile and their total volumebelow_quartile_stats <- full_data_customer %>%left_join(quartile_data %>% dplyr::select(COLD_DRINK_CHANNEL, First_Quartile_Val), by ="COLD_DRINK_CHANNEL") %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Num_Customers_Below_1Q =sum(((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024)) <= First_Quartile_Val, na.rm =TRUE),Vol_Below_1Q =sum(((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024))[((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024)) <= First_Quartile_Val], na.rm =TRUE),.groups ='drop' )# Count the total number of customers per channelcustomer_count <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(N_Cust =n(), .groups ='drop')# Combine all the datafinal_summary <- customer_count %>%left_join(data_summary, by ="COLD_DRINK_CHANNEL") %>%left_join(quartile_data, by ="COLD_DRINK_CHANNEL") %>%left_join(below_quartile_stats, by ="COLD_DRINK_CHANNEL") %>%mutate(Vol_Perct =round((Vol_Below_1Q / Tot_Vol) *100, 1),First_Quartile_Vol =as.integer(First_Quartile_Val) ) %>% dplyr::select(COLD_DRINK_CHANNEL, N_Cust, Avg_Vol_Cust, Median_Vol_Cust, First_Quartile_Vol, Num_Customers_Below_1Q, Vol_Perct)# Display the table with kable and stylingkable(final_summary, format ="html", escape =FALSE, align ="c", col.names =c("Cold Drink Channel", "Total Cust.", "Avg. Vol Cust.", "Median Vol Cust.", "1st Quartile Qtd", "Cust. Below 1st Quart", "Vol % Below 1st Quart")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:7, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="lightyellow") %>%add_header_above(c("Customers Analysis by Cold Drink Channel"=7)) %>%kable_paper("striped", full_width =FALSE)
Customers Analysis by Cold Drink Channel
Cold Drink Channel
Total Cust.
Avg. Vol Cust.
Median Vol Cust.
1st Quartile Qtd
Cust. Below 1st Quart
Vol % Below 1st Quart
ACCOMMODATION
1235
727
376
122
310
2.0
BULK TRADE
1320
7060
1420
444
330
0.7
CONVENTIONAL
57
190
99
39
15
3.1
DINING
15400
633
283
98
3860
1.8
EVENT
3074
1496
329
91
771
0.7
GOODS
5826
628
209
104
1465
2.1
PUBLIC SECTOR
1736
1085
283
94
435
1.1
WELLNESS
479
2413
625
182
119
0.9
WORKPLACE
1193
4046
200
87
303
0.3
For customers with total consumption volumes in 2023 and 2024 below the first quartile, the sums represent very low percentages, ranging from 0.3% to 3.1% of the total for each segment. In the dining segment, for example, 25% of customers showed demand below the first quartile.
Some of these customers have been identified as having high growth potential, as their demand growth is above average. This happens because any increase in demand from these low-volume customers results in higher growth percentages.
The low RFM scores also indicate that these customers have low recency, frequency, and total volume of purchases. Therefore, a flag, LOW_DEMAND_CUST, will be created, where a value of 1 will indicate low-consumption customers. With this flag, a white truck will be assigned to these customers, regardless of their growth indices.
Below are the cut volumes by segment:
Code
# Extract the list of 'Cold Drink Channel' and '1st Quartile Qty'list_summary <- final_summary %>% dplyr::select(COLD_DRINK_CHANNEL, First_Quartile_Vol) %>%deframe()# Display the listlist_summary
# Calculate the sum per row and assign LOW_DEMAND_CUSTfull_data_customer <- full_data_customer %>%mutate(Total_Vol_Cust = (QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),LOW_DEMAND_CUST =if_else(Total_Vol_Cust <= list_summary[COLD_DRINK_CHANNEL], 1, 0) )
In the plot below, the numbers represent the percentages and the number of customers who received this flag.
Code
# Group and calculate the number of customers with LOW_DEMAND_CUST by LOCAL_FOUNT_ONLYsummary_low_demand <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, LOW_DEMAND_CUST) %>%summarise(total_customers =n(),.groups ="drop" )# Calculate the percentage for each groupsummary_low_demand <- summary_low_demand %>%group_by(LOCAL_FOUNT_ONLY) %>%mutate(percentage = total_customers /sum(total_customers) *100 )# Plot for percentages with LOW_DEMAND_CUST as fill and LOCAL_FOUNT_ONLY as groupsggplot(summary_low_demand, aes(x =factor(LOCAL_FOUNT_ONLY), y = percentage, fill =factor(LOW_DEMAND_CUST))) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label =paste0(scales::comma(percentage, suffix ="%"), " (", total_customers, ")")),position =position_dodge(width =0.8), vjust =-0.2, size =3.5) +labs(title ="Percentage of Customers with Low Demand") +scale_fill_manual(values =c("0"="darkolivegreen", "1"="sandybrown"), labels =c("0"="Others (Above Q1)", "1"="Low Demand")) +# Set colors and labels for LOW_DEMAND_CUSTtheme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_text(face ="bold", size =10), # Add legend titlelegend.position ="right", # Position legend on the right sidelegend.box ="vertical", # Ensure vertical arrangement for the legendpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10, angle =0), # Display x-axis labels without rotationpanel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +# Set x-axis labelsguides(fill =guide_legend(title ="Low Demand Status")) # Add legend title
5.3.2 Demand Variation between all stores
To measure demand growth patterns across our customer base (January 2023 - December 2024):
Data Preparation: Combined monthly case and gallon deliveries for each customer into total monthly volumes.
Eligibility: Required ≥6 months with positive orders for reliable analysis. Customers with <6 ordering months were classified as having no growth potential (6,026 customers).
Growth Calculation:
Split each qualifying customer’s order history into two equal time periods
For odd numbers of months, divided the middle month equally between periods
Calculated growth rate as: (Second Period Total - First Period Total) / First Period Total
Classification: Customers with growth rates exceeding the average positive growth rate were categorized as high growth potential (HIGH_GROW_POT = 1), while all others received a standard classification (HIGH_GROW_POT = 0).
Code
# Initialize new columns in the datasetfull_data_customer$NUM_POSITIVE_SUMS <-0full_data_customer$QTD_DLV_FIRST_HALF <-0full_data_customer$QTD_DLV_SECOND_HALF <-0full_data_customer$DEMAND_VARIATION <-NA# Initialize as NA# Process each customer individuallyfor (i in1:nrow(full_data_customer)) {# Create a vector of positive sums while maintaining the chronological order POSITIVE_SUMS <-c()# Iterate over the 24 months in the correct sequencefor (j in1:24) {# Create column names year <-2023+ (j -1) %/%12 month <- (j -1) %%12+1 CA_COL <-paste0("QTD_DLV_CA_", sprintf("%04d", year), "_", sprintf("%02d", month)) GAL_COL <-paste0("QTD_DLV_GAL_", sprintf("%04d", year), "_", sprintf("%02d", month))# Check if columns exist in the datasetif (CA_COL %in%names(full_data_customer) && GAL_COL %in%names(full_data_customer)) { CA_VALUE <- full_data_customer[[CA_COL]][i] GAL_VALUE <- full_data_customer[[GAL_COL]][i]# Replace NA with 0 CA_VALUE <-ifelse(is.na(CA_VALUE), 0, CA_VALUE) GAL_VALUE <-ifelse(is.na(GAL_VALUE), 0, GAL_VALUE)# Sum values for the month SUM_VALUE <- CA_VALUE + GAL_VALUE# Add to the list if positiveif (SUM_VALUE >0) { POSITIVE_SUMS <-c(POSITIVE_SUMS, SUM_VALUE) } } }# Total number of positive operations num_operations <-length(POSITIVE_SUMS) full_data_customer$NUM_POSITIVE_SUMS[i] <- num_operations# If fewer than 6 positive sums, set values accordingly and continueif (num_operations <6) { full_data_customer$QTD_DLV_FIRST_HALF[i] <-0 full_data_customer$QTD_DLV_SECOND_HALF[i] <-0 full_data_customer$DEMAND_VARIATION[i] <-NAnext }# Initialize the two halves QTD_DLV_FIRST_HALF <-0 QTD_DLV_SECOND_HALF <-0# Split the operations into two halvesif (num_operations %%2==0) {# If even number of operations mid_point <- num_operations /2 QTD_DLV_FIRST_HALF <-sum(POSITIVE_SUMS[1:mid_point]) QTD_DLV_SECOND_HALF <-sum(POSITIVE_SUMS[(mid_point +1):num_operations]) } else {# If odd number of operations mid_point <- (num_operations +1) %/%2# Split the middle value between both halves first_part <-if(mid_point >1) POSITIVE_SUMS[1:(mid_point -1)] elsenumeric(0) central_value <- POSITIVE_SUMS[mid_point] /2 second_part <-if(mid_point < num_operations) POSITIVE_SUMS[(mid_point +1):num_operations] elsenumeric(0) QTD_DLV_FIRST_HALF <-sum(c(first_part, central_value)) QTD_DLV_SECOND_HALF <-sum(c(central_value, second_part)) }# Assign values to the dataset full_data_customer$QTD_DLV_FIRST_HALF[i] <- QTD_DLV_FIRST_HALF full_data_customer$QTD_DLV_SECOND_HALF[i] <- QTD_DLV_SECOND_HALF# Calculate demand variationif (QTD_DLV_FIRST_HALF >0) { # Avoid division by zero DEMAND_VARIATION_VALUE <- (QTD_DLV_SECOND_HALF - QTD_DLV_FIRST_HALF) / QTD_DLV_FIRST_HALF full_data_customer$DEMAND_VARIATION[i] <- DEMAND_VARIATION_VALUE } else { full_data_customer$DEMAND_VARIATION[i] <-NA }}# Create the HIGH_GROW_POT columnfull_data_customer$HIGH_GROW_POT <-0# Initialize all values to 0# Calculate the mean of DEMAND_VARIATION for positive values onlypositive_variations <- full_data_customer$DEMAND_VARIATION[full_data_customer$DEMAND_VARIATION >0]if (length(positive_variations) >0) { mean_value <-mean(positive_variations, na.rm =TRUE)# Display the calculated meancat("Calculated mean of positive DEMAND_VARIATION: ", mean_value, "\n")# Assign 1 for customers with DEMAND_VARIATION greater than the mean full_data_customer$HIGH_GROW_POT <-ifelse(!is.na(full_data_customer$DEMAND_VARIATION) & full_data_customer$DEMAND_VARIATION > mean_value, 1, full_data_customer$HIGH_GROW_POT)} else {cat("No positive DEMAND_VARIATION values found\n")}
Calculated mean of positive DEMAND_VARIATION: 0.2843618
Considering all customers, there was an average demand growth variation of 28%. However, 6,026 customers were excluded from the analysis as their growth could not be calculated due to having fewer than 6 periods of orders. For these customers, it was assumed that they have no growth potential.
Below, the number of customers whose growth exceeded the average, regardless of the segment.
Code
# Group and calculate the percentage of customers with HIGH_GROW_POT = 1 and 0 by LFOsummary_high_growth <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY) %>%summarise(high_growth =sum(HIGH_GROW_POT ==1, na.rm =TRUE),low_growth =sum(HIGH_GROW_POT ==0, na.rm =TRUE),total_customers =n(),.groups ="drop" ) %>%mutate(pct_high_growth = high_growth / total_customers *100,pct_low_growth = low_growth / total_customers *100 )# Transform data into long format for percentagessummary_high_growth_long <- summary_high_growth %>%pivot_longer(cols =starts_with("pct_"),names_to ="growth_type",values_to ="percentage" ) %>%mutate(growth_type =factor(growth_type, levels =c("pct_low_growth", "pct_high_growth"), labels =c("Low Growth Potential", "High Growth Potential")) )# Ensure LFO is a factorsummary_high_growth_long$LOCAL_FOUNT_ONLY <-factor(summary_high_growth_long$LOCAL_FOUNT_ONLY, levels =c("0", "1"))# Plot for percentages with the legend on the sideggplot(summary_high_growth_long, aes(x = LOCAL_FOUNT_ONLY, y = percentage, fill = growth_type)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label = scales::comma(percentage, suffix ="%")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +labs(title ="Percentage of Customers Classified as Low or High Growth Potential") +scale_fill_manual(values =c("Low Growth Potential"="#FF6347", "High Growth Potential"="#40E0D0")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_blank(), # Remove legend titlelegend.position ="right", # Position legend on the right sidelegend.box ="vertical", # Ensure vertical arrangement for the legendpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +guides(fill =guide_legend(title ="Growth Potential")) # Add a legend title
Code
# Group and calculate the number of customers with HIGH_GROW_POT = 1 and 0 by LFOsummary_high_growth <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY) %>%summarise(low_growth =sum(HIGH_GROW_POT ==0, na.rm =TRUE), high_growth =sum(HIGH_GROW_POT ==1, na.rm =TRUE),.groups ="drop" )# Display the summary with the count of customers#summary_high_growth
Approximately 9% of customers (123) identified as local market partners who purchase fountain-only products show high growth potential according to the established criteria. For other customers, about 12% (3450) are classified as having high growth potential.
Customers with high volumes are somewhat penalized by this criterion, as significant demand growth is more difficult to achieve. However, their substantial volume already places them as strategic partners, making them essential for close monitoring and prioritized deliveries via red trucks. For these customers, lower distribution costs allow for more competitive pricing, supporting the long-term sustainability of the partnership.
5.3.3 Demand Variation by Cold Drink Channel
Each customer’s growth potential was considered within their respective segment. Following the same criteria as before, only customers whose demand variation exceeded the segment average were classified as high potential. Below is the calculated demand variation for each Cold Drink Channel during the period.
Code
# Define the custom color palette for COLD_DRINK_CHANNEL with unique colorscold_drink_channel_colors <-c("DINING"="#A7ADC6", "PUBLIC SECTOR"="#FF6347", "EVENT"="#B33951", "WORKPLACE"="#ABD2FA", "ACCOMMODATION"="#E377C2", "GOODS"="#FFD700", "BULK TRADE"="#8ED081", "WELLNESS"="#20B2AA", "CONVENTIONAL"="#1F77B4")# Aggregate data: mean DEMAND_VARIATION by channelsummary_growth_channel <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(CHANNEL_VAR =mean(DEMAND_VARIATION, na.rm =TRUE))# Create horizontal bar chartggplot(summary_growth_channel, aes(x = CHANNEL_VAR, y =reorder(COLD_DRINK_CHANNEL, CHANNEL_VAR), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", alpha =0.6) +geom_text(aes(label =paste0(round(CHANNEL_VAR *100, 1), "%")), hjust =-0.01, color ="black", size =3.2) +labs(title ="Average Demand Variation by Cold Drink Channel",x ="Percentage Variation (%)", y =NULL) +scale_x_continuous(labels = scales::label_percent(accuracy =0.1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10), axis.title.x =element_text(size =10), legend.position ="none",panel.grid.major =element_blank(), panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray", size =0.5) )
Dining and bulk trade are the most important channels, with customers increasing their demand by 2.1% and 5.6%, respectively, on average.
Wellness experienced the highest variation at almost 10%, but it accounts for only 3.2% of the total volume sold. Goods had the second-highest variation, at 9%, and represents 10% of the total volume.
Code
# Calculate the mean DEMAND_VARIATION for each COLD_DRINK_CHANNELchannel_means <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(MEAN_DEMAND_VARIATION =mean(DEMAND_VARIATION, na.rm =TRUE))# Merge the mean values with the full_data_customerfull_data_customer <- full_data_customer %>%left_join(channel_means, by ="COLD_DRINK_CHANNEL")# Create the CHANNEL_GROWTH_POT columnfull_data_customer$CHANNEL_GROWTH_POT <-ifelse(is.na(full_data_customer$DEMAND_VARIATION), 0,ifelse(full_data_customer$DEMAND_VARIATION > full_data_customer$MEAN_DEMAND_VARIATION, 1, 0))# Remove the MEAN_DEMAND_VARIATION columnfull_data_customer <- full_data_customer %>% dplyr::select(-MEAN_DEMAND_VARIATION)# Calculate the percentage of customers with high growth potential by channelsummary_growth_channel_customers <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(pct_high_growth =mean(CHANNEL_GROWTH_POT ==1, na.rm =TRUE) *100)# Create the horizontal bar chartggplot(summary_growth_channel_customers, aes(x = pct_high_growth, y =reorder(COLD_DRINK_CHANNEL, pct_high_growth), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", alpha =0.6) +geom_text(aes(label =paste0(round(pct_high_growth, 1), "%")), hjust =-0.01, color ="black", size =3.2) +labs(title ="Percentage of Customers with High Growth Potential by Cold Drink Channel",x ="Percentage of Customers (%)", y =NULL) +scale_x_continuous(labels = scales::label_number(accuracy =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Now correctly using the defined palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10), axis.title.x =element_text(size =10), legend.position ="none", panel.grid.major =element_blank(), panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray", size =0.5) )
The majority of segments showed more than 30% of stores with growth above the average for their group. Only the ‘Events’ segment presented a lower percentage, close to 23%. These customers will be classified as high-growth in their respective segments.
The number of customers with a variation higher than the average for each cold drink channel significantly expands the high-potential customer base. Even when simulating the number of customers with 100% growth above the average, the base was still elevated. Therefore, this criterion will need further analysis before potentially being considered.
6. Correlations
Customer Features X RFM_SCORE
Seeking to understand how the variables correlate, based on our understanding of the dataset and with the goal of obtaining clear information without multicollinearity, we chose to select numeric variables and display only the most significant correlations (disregarding the range between -0.2 and 0.2).
Code
# List of selected variablesselected_vars <-c("LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER", "DAYS_ONBOARDING", "DAYS_FIRST_DLV", "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")# Select only the numeric variables from the datasetnumeric_vars <- full_data_customer %>% dplyr::select(all_of(selected_vars)) %>% dplyr::select(where(is.numeric))# Compute the correlation matrix (handling missing values)cor_matrix <-cor(numeric_vars, use ="pairwise.complete.obs")# Replace NAs in correlation matrix with 0 to avoid errorscor_matrix[is.na(cor_matrix)] <-0# Remove variables with a perfect correlation of 1cor_matrix[cor_matrix ==1] <-NA# Set correlations of 1 to NA to exclude them# Convert correlation matrix to long formatcor_df <-as.data.frame(cor_matrix) %>%rownames_to_column(var ="Variable1") %>%pivot_longer(cols =-Variable1, names_to ="Variable2", values_to ="Correlation") %>%filter(!is.na(Correlation)) %>%# Remove NAs which represent correlations of 1filter((Correlation >=0.20& Correlation <=0.99) | (Correlation <=-0.20& Correlation >=-0.99)) %>%# Keep only correlations outside of the -0.20 to 0.20 rangemutate(Correlation =round(Correlation, 2)) %>%# Round correlations to 2 decimal placesmutate(pair_id =paste0(pmin(Variable1, Variable2), "-", pmax(Variable1, Variable2))) %>%distinct(pair_id, .keep_all =TRUE) %>%# Remove duplicate pairs (A-B, B-A) dplyr::select(Variable1, Variable2, Correlation) %>%arrange(desc(Correlation)) # Sort by correlation value from highest to lowest# Display the correlation matrix in kable format with stylingcor_df %>%kable("html", col.names =c("Variable 1", "Variable 2", "Correlation")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:3, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="lightgray") %>%add_header_above(c("Correlations Between Selected Variables (over -+0.2)"=3)) %>%kable_paper("striped", full_width =FALSE)
Code
# Compute the correlation matrix for the selected numeric variablescorrelation_matrix <-cor(numeric_vars, use ="pairwise.complete.obs")# Replace NAs with 0 to avoid errorscorrelation_matrix[is.na(correlation_matrix)] <-0# Visualize the correlation matrix with rotated labelscorrplot(correlation_matrix, method ="circle", type ="upper", tl.cex =0.8, tl.col ="black", tl.srt =45, number.cex =0.6, diag =FALSE, # Remove diagonalcol =colorRampPalette(c("blue", "white", "red"))(200)) # Color palette
The strongest correlations were observed between Days Onboarding and Days After First Delivery (0.7) and between the order types MyCoke Legacy and MyCoke 360 (0.66). Both relationships make sense: customers who onboarded earlier tend to have older orders, except for cases where a new store belongs to an established chain. Similarly, customers who previously used the legacy channel transitioned to the newer 360 platform.
There is a correlation of 0.53 between overall customer growth and growth within the Cold Drink Channel, suggesting that expansion trends align across segments. The RFM Score also correlates with various variables that were not directly considered in its calculation, with correlations ranging from 0.44 to 0.27.
Among the negative correlations, the most notable is between RFM Score and Low Demand Customer (-0.65), indicating that lower RFM scores effectively capture low-demand customers.
Census X Total Ordered
All the correlations between the 2023 updated census data showed very low correlations, close to zero, in relation to the customers’ consumption patterns.
For this reason, these variables will be excluded, along with others no longer required, to streamline full_data_customer. However, the process will be retained in this document, as the company may obtain different results when applying real locations.
7. Customer Segmentation
Since all customers in the original dataset are served by red trucks, there is no prior information on the characteristics of those who would be served by white trucks. The only available reference is the average annual consumption threshold of 400 gallons or cases.
To address this, customers were segmented based on their most relevant characteristics within the available scope, including variables created during the analysis.
Variables selected represent store-level traits or consumption behavior, with geographic and census data excluded.
The variables selected are listed below:
Customer Type & Relationship:
These variables represent customers’ relationship with the company and their type:
- LOCAL_FOUNT_ONLY: Customers who only consume fountain drinks.
- LOCAL_MARKET_PARTNER: Local market partners.
- CO2_CUSTOMER: Customers who are CO2 consumers.
- CHAIN_MEMBER: Customers who are part of a chain.
Time-Related Metrics:
Time-related metrics track customers’ activity and engagement over time:
- DAYS_ONBOARDING: Number of days since onboarding.
- DAYS_FIRST_DLV: Number of days since the first delivery.
- DAYS_AF_LAST_ORD: Number of days after the last order.
- AVG_DAYS_BET_ORD: Average number of days between orders.
Order & Sales Behavior:
These variables represent customer behaviors in terms of orders and sales:
- NUM_ORDERS: Total number of orders.
- TOTAL_ORDERED: Total volume of orders.
- RFM_SCORE: Adapted Recency, Frequency, Monetary score.
- TOTAL_COST_CA_GAL: Total cost in deliveries for 2023 and 2024.
Order Channels:
This category contains data on the various channels through which customers make their transactions:
- OT_CALL.CENTER: Transactions via call center.
- OT_OTHER: Transactions made through other means (emails, trade fairs, etc.).
- OT_SALES.REP: Transactions via sales representatives.
- OT_MYCOKE: Transactions via MyCoke (legacy platform).
- OT_EDI: Transactions via electronic direct ordering (EDI).
Growth & Demand Potential:
These flags indicate customers’ growth and demand potential:
- HIGH_GROW_POT: Flag for customers with above-average growth potential across all segments.
- CHANNEL_GROWTH_POT: Flag for customers with above-average growth within their segment.
- LOW_DEMAND_CUST: Flag for customers with low demand (below the 1st quartile) by segment.
Three variables have a wide range of values with extreme outliers. For these variables—NUM_ORDERS, TOTAL_ORDERED, and TOTAL_COST_CA_GAL—we will apply a logarithmic transformation.
Code
# Select the primary variables for clustering based on business relevanceselected_vars <-c("LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER", "DAYS_ONBOARDING", "DAYS_FIRST_DLV", "DAYS_AF_LAST_ORD", "AVG_DAYS_BET_ORD", "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", "NUM_ORDERS", "TOTAL_ORDERED", "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")# Extract the data and apply log transformation to NUM_ORDERS and TOTAL_ORDEREDdata_to_cluster <- full_data_customer %>% dplyr::select(all_of(selected_vars)) %>% dplyr::select(where(is.numeric))# Apply log transformation directly on the selected numeric variablesdata_to_cluster$DAYS_ONBOARDING <-log1p(data_to_cluster$DAYS_ONBOARDING)data_to_cluster$DAYS_FIRST_DLV <-log1p(data_to_cluster$DAYS_FIRST_DLV)data_to_cluster$TOTAL_ORDERED <-log1p(data_to_cluster$TOTAL_ORDERED)data_to_cluster$TOTAL_COST_CA_GAL <-log1p(data_to_cluster$TOTAL_COST_CA_GAL)# Standardize the numeric variables for clusteringdata_to_cluster <-scale(data_to_cluster)# Determine optimal number of clusters using the Elbow Methodset.seed(500) wss <-sapply(1:10, function(k) kmeans(data_to_cluster, centers = k, nstart =25)$tot.withinss)# Visualize the Elbow Method resultsplot(1:10, wss, type ="b", pch =19, frame =FALSE, xlab ="Number of Clusters", ylab ="Total Within Sum of Squares (WSS)", main ="Elbow Method for Optimal K")
After testing different compositions to calculate the silhouette score and ARI score—varying the number of clusters from 2 to 4, using multiple distance metrics (Euclidean, Manhattan), and applying different algorithms (Hartigan-Wong, MacQueen, Lloyd)—the most relevant metrics are presented below.
Code
# Set seed for reproducibilityset.seed(500)# Function to calculate the Silhouette Scorecalculate_silhouette_score <-function(model, data) { clusters <- model$cluster if (length(clusters) !=nrow(data)) {stop("Cluster assignments do not match the number of data points.") } dist_matrix <-dist(data) silhouette_score <-silhouette(clusters, dist_matrix)return(mean(silhouette_score[, 3])) }# Function to calculate Adjusted Rand Indexcalculate_ari <-function(model, true_labels) { clusters <- model$cluster ari_score <-adjustedRandIndex(clusters, true_labels)return(ari_score)}# Select the primary variables for clustering based on business relevanceselected_vars <-c("LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER", "DAYS_ONBOARDING", "DAYS_FIRST_DLV", "DAYS_AF_LAST_ORD", "AVG_DAYS_BET_ORD", "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", "NUM_ORDERS", "TOTAL_ORDERED", "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")# Extract the data and apply log transformation to NUM_ORDERS and TOTAL_ORDEREDdata_to_cluster <- full_data_customer %>% dplyr::select(all_of(selected_vars)) %>% dplyr::select(where(is.numeric))# Apply log transformation directly on the selected numeric variablesdata_to_cluster$DAYS_ONBOARDING <-log1p(data_to_cluster$DAYS_ONBOARDING)data_to_cluster$DAYS_FIRST_DLV <-log1p(data_to_cluster$DAYS_FIRST_DLV)data_to_cluster$TOTAL_ORDERED <-log1p(data_to_cluster$TOTAL_ORDERED)data_to_cluster$TOTAL_COST_CA_GAL <-log1p(data_to_cluster$TOTAL_COST_CA_GAL)# Standardize the numeric variables for clusteringdata_to_cluster <-scale(data_to_cluster)# Define different parameter configurations for K-meansparams <-list(list(name ="Euclidean, 2 Clusters", centers =2, nstart =25, algorithm ="Hartigan-Wong"),list(name ="Euclidean, 3 Clusters", centers =3, nstart =25, algorithm ="Hartigan-Wong"),list(name ="Euclidean, 4 Clusters", centers =4, nstart =25, algorithm ="Hartigan-Wong")# ,# list(name = "Manhattan, 3 Clusters", centers = 2, nstart = 25, algorithm = "MacQueen"),# list(name = "Manhattan, 2 Clusters", centers = 3, nstart = 25, algorithm = "MacQueen"),# list(name = "Manhattan, 4 Clusters", centers = 4, nstart = 25, algorithm = "MacQueen"),# list(name = "K-means++, 3 Clusters", centers = 2, nstart = 25, algorithm = "Lloyd"),# list(name = "K-means++, 2 Clusters", centers = 3, nstart = 25, algorithm = "Lloyd"),# list(name = "K-means++, 4 Clusters", centers = 4, nstart = 25, algorithm = "Lloyd"))# Apply K-means clustering and store resultsresults <-lapply(params, function(param) { model <-kmeans(data_to_cluster, centers = param$centers, nstart = param$nstart, algorithm = param$algorithm) silhouette <-calculate_silhouette_score(model, data_to_cluster) ari <-calculate_ari(model, full_data_customer$LOCAL_MARKET_PARTNER) # You can replace with a true label column if neededreturn(data.frame(Model = param$name, Silhouette_Score =round(silhouette, 3), ARI =round(ari, 3)))})# Combine results into a single tableresults_df <-do.call(rbind, results)# Display table using kablekable(results_df, col.names =c("Parameter", "Silhouette Score", "Adjusted Rand Index (ARI)"))
Parameter
Silhouette Score
Adjusted Rand Index (ARI)
Euclidean, 2 Clusters
0.210
-0.054
Euclidean, 3 Clusters
0.180
0.043
Euclidean, 4 Clusters
0.176
0.048
Given the results, the combination “Euclidean, 3 Clusters” was selected, using centers = 3, nstart = 25, and the “Hartigan-Wong” algorithm (default), as it demonstrated the best performance among the tested options. Still, the separation between clusters remains marginal and relatively weak.
Below is the visualization of the clusters based on the two principal components.
Code
# Implement K-means with optimal number of clusters set.seed(500)kmeans_optimal <-kmeans(data_to_cluster, centers =3, nstart =25, algorithm ="Hartigan-Wong")# Add cluster assignments to the original datasetfull_data_customer$CLUSTER <- kmeans_optimal$cluster# Define custom colors for the clusterspalette_clusters <-c("1"="#FF6347", # Coral"2"="#4682B4", # Cornflower blue"3"="#FFD700") # Yellow# Visualize cluster distribution with PCA-reduced dimensionsfviz_cluster(kmeans_optimal, data = data_to_cluster, geom ="point", ellipse.type ="none", main ="Customer Segmentation: PCA-based Visualization",subtitle ="K-means Optimal Clustering with 3 Segments",ggtheme =theme_minimal()) +scale_color_manual(values = palette_clusters) # Manually set colors
The customer segmentation will be discussed later, including the interpretation of each cluster.
7.1 Clusters and principal components
Given the visualization of the clusters through their principal components, the decision was made to further explore the characteristics of the two main components, as they account for 39% of the total variability.
Code
# Select the desired variables for clustering - adjusted to match your clustering selectionselected_vars <-c("LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER","DAYS_ONBOARDING", "DAYS_FIRST_DLV", "DAYS_AF_LAST_ORD", "AVG_DAYS_BET_ORD", "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", "NUM_ORDERS", "TOTAL_ORDERED", "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")# Select only the desired variables from full_data_customer - works for both data.frame and data.tablecustomer_data <- full_data_customer[, selected_vars]# Remove rows with NA values (if any)customer_data <-na.omit(customer_data)# Apply log transformation to the same variables as in clusteringcustomer_data$DAYS_ONBOARDING <-log1p(customer_data$DAYS_ONBOARDING)customer_data$DAYS_FIRST_DLV <-log1p(customer_data$DAYS_FIRST_DLV)customer_data$TOTAL_ORDERED <-log1p(customer_data$TOTAL_ORDERED)customer_data$TOTAL_COST_CA_GAL <-log1p(customer_data$TOTAL_COST_CA_GAL)# Build scales for the datasetscales <-build_scales(customer_data, verbose =FALSE)# Scaling columnscustomer_data <-fast_scale(customer_data, scales = scales, verbose =FALSE)# Calculating the covariance matrixcov_customer <-cov(customer_data)# Calculating the Eigenvector and Eigenvalues of the variance-covariance matrixe_customer <-eigen(cov_customer)eigenvalues_customer <- e_customer$valueseigenvectors_customer <- e_customer$vectors# Print#print(paste("Counts the number of eigenvalues:", length(eigenvalues_customer)))# Initialize an empty matrix to store the contributions of variables to all PCscontributions_matrix <-matrix(NA, nrow =ncol(customer_data), ncol =ncol(eigenvectors_customer))# Loop through all principal componentsfor (i in1:ncol(eigenvectors_customer)) {# Get the contributions of variables to the i-th principal component (PC) pc_contributions <- eigenvectors_customer[, i]# Assign the contributions to the corresponding column in the matrix contributions_matrix[, i] <-round(pc_contributions, 2) # Round to 2 decimal places}# Convert the matrix to a data frame and assign appropriate row and column namescontributions_df <-as.data.frame(contributions_matrix)colnames(contributions_df) <-paste0("PC", 1:ncol(contributions_matrix)) # Name the columns dynamicallyrownames(contributions_df) <-colnames(customer_data) # Assign the variable names as row names# Variance Explained# Calculate the variance explained by each principal componentvariance_explained <- eigenvalues_customer /sum(eigenvalues_customer)# Round the variance explained to 2 decimal placesvariance_row <-round(variance_explained, 2)# Calculate the cumulative variance explainedcumulative_variance <-cumsum(variance_explained)# Round the cumulative variance to 2 decimal placescumulative_variance_row <-round(cumulative_variance, 2)# Add the variance and cumulative variance rows to the bottom of the data framecontributions_df <-rbind(contributions_df, Variance_Explained = variance_row,Cumulative_Variance = cumulative_variance_row)# Format the table using formattable for heatmap effectformattable(contributions_df, list(# Apply color gradient to all columnsarea(col =1:ncol(contributions_df)) ~color_tile("white", "deepskyblue3") ))
PC1
PC2
PC3
PC4
PC5
PC6
PC7
PC8
PC9
PC10
PC11
PC12
PC13
PC14
PC15
PC16
PC17
PC18
PC19
PC20
PC21
LOCAL_FOUNT_ONLY
-0.05
0.00
0.00
0.03
-0.18
0.29
0.73
-0.24
0.00
0.40
-0.14
-0.16
-0.28
0.01
-0.01
0.01
0.04
-0.04
0.03
0.04
0.00
LOCAL_MARKET_PARTNER
-0.12
0.02
0.22
-0.14
-0.13
0.35
0.18
0.41
0.29
-0.52
0.24
0.13
-0.38
-0.03
0.02
-0.06
0.02
0.04
0.01
-0.02
0.00
CO2_CUSTOMER
0.00
-0.02
0.24
-0.36
0.52
-0.10
-0.25
0.08
0.10
0.37
-0.04
-0.14
-0.54
-0.02
-0.04
0.03
0.06
-0.05
0.05
0.05
0.00
CHAIN_MEMBER
0.14
0.00
-0.30
0.30
-0.36
-0.18
-0.25
-0.11
-0.23
-0.17
-0.07
-0.20
-0.65
-0.10
-0.01
-0.12
0.00
0.01
-0.01
-0.03
0.00
DAYS_ONBOARDING
0.20
-0.43
-0.17
0.25
0.26
0.00
0.11
0.10
0.19
-0.04
0.12
-0.24
0.03
0.03
0.07
0.00
-0.02
0.42
0.54
0.00
0.00
DAYS_FIRST_DLV
0.22
-0.43
-0.14
0.26
0.26
0.01
0.11
0.10
0.21
-0.07
0.06
-0.14
0.00
0.02
-0.01
0.01
0.01
-0.39
-0.61
0.03
0.00
DAYS_AF_LAST_ORD
-0.22
-0.10
-0.23
0.22
0.19
-0.17
0.11
0.04
0.06
0.13
-0.11
0.78
-0.18
-0.04
-0.05
-0.22
0.08
0.11
0.00
0.14
0.00
AVG_DAYS_BET_ORD
-0.32
0.07
-0.12
0.04
0.02
-0.12
0.02
0.00
-0.03
0.08
0.29
-0.06
-0.07
-0.03
0.74
0.29
-0.21
0.10
-0.15
0.21
0.00
OT_CALL.CENTER
0.12
-0.34
0.00
-0.11
-0.04
0.48
-0.22
-0.20
-0.37
0.15
0.39
0.24
-0.03
0.02
0.06
-0.09
-0.09
-0.10
0.06
-0.07
-0.37
OT_OTHER
0.04
-0.01
-0.05
-0.06
0.05
-0.19
0.27
0.67
-0.62
0.09
0.03
-0.08
0.04
-0.04
-0.07
-0.03
-0.05
-0.03
0.01
0.01
-0.09
OT_SALES.REP
0.14
-0.03
-0.10
-0.42
-0.01
-0.52
0.31
-0.26
0.12
-0.24
0.15
0.02
-0.03
0.00
0.02
-0.14
-0.11
-0.10
0.06
-0.05
-0.46
OT_MYCOKE.LEGACY
0.19
0.42
0.03
0.39
0.20
0.06
0.04
0.02
0.07
-0.01
0.08
0.12
-0.09
-0.04
-0.22
0.53
-0.21
-0.06
0.07
-0.06
-0.41
OT_MYCOKE360
0.16
0.44
0.16
0.30
0.21
0.05
0.02
0.01
0.04
0.08
0.18
-0.12
0.06
0.05
0.26
-0.68
0.06
-0.06
0.01
-0.02
-0.10
OT_EDI
0.07
0.00
-0.19
-0.05
-0.47
-0.05
-0.19
0.39
0.48
0.49
0.08
0.00
0.07
-0.03
-0.03
-0.07
-0.09
-0.09
0.04
0.00
-0.22
NUM_ORDERS
0.34
0.12
-0.10
-0.09
-0.01
-0.10
0.10
-0.06
0.01
0.10
0.44
0.20
-0.06
-0.02
-0.08
0.04
-0.29
-0.20
0.13
-0.12
0.65
TOTAL_ORDERED
0.37
0.07
-0.03
-0.11
-0.02
-0.01
0.05
0.05
0.00
0.09
-0.05
0.16
-0.04
0.04
0.28
0.15
0.36
0.39
-0.28
-0.59
0.00
RFM_SCORE
0.37
0.11
0.02
-0.14
-0.07
0.04
0.00
-0.05
-0.02
0.02
0.13
0.01
0.01
0.01
-0.19
-0.02
0.02
0.49
-0.32
0.65
0.00
HIGH_GROW_POT
0.02
-0.20
0.53
0.24
-0.22
-0.29
0.01
0.00
-0.04
0.07
0.08
0.09
-0.09
0.67
0.00
0.06
-0.04
0.01
0.00
0.01
0.00
CHANNEL_GROWTH_POT
0.09
-0.21
0.55
0.19
-0.15
-0.20
0.03
-0.05
0.00
0.07
0.03
0.07
0.04
-0.72
0.07
0.01
-0.01
0.02
0.00
0.00
0.00
LOW_DEMAND_CUST
-0.30
0.05
-0.08
0.10
-0.02
-0.15
0.01
-0.07
-0.01
0.07
0.58
-0.13
0.03
-0.05
-0.28
0.10
0.64
-0.02
0.00
-0.01
0.00
TOTAL_COST_CA_GAL
0.36
0.03
-0.02
-0.04
-0.06
0.03
0.00
0.06
-0.02
-0.08
-0.17
0.15
0.02
0.04
0.33
0.19
0.49
-0.42
0.33
0.36
0.01
Variance_Explained
0.30
0.09
0.07
0.07
0.06
0.06
0.05
0.05
0.05
0.04
0.03
0.03
0.02
0.02
0.02
0.02
0.01
0.00
0.00
0.00
0.00
Cumulative_Variance
0.30
0.39
0.46
0.54
0.60
0.66
0.71
0.76
0.80
0.84
0.87
0.90
0.93
0.95
0.96
0.98
0.99
0.99
1.00
1.00
1.00
Principal Component 1 has the highest weight from the variables RFM_SCORE, NUM_ORDERS, TOTAL_ORDERED, and TOTAL_COST_CA_GAL, representing 30% of the variance.
Principal Component 2 adds another 9% of variance, with the highest weight from the OT_MYCOKE variables.
7.2 Clusters Features
The clusters will be characterized based on their relationships with other variables.
Code
# Define specific colors for fleet typesfleet_colors <-c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3") # Custom colors for FleetType# Create a cross-tabulation of CLUSTER and FLEET_TYPEcluster_fleet_table <-table(full_data_customer$CLUSTER, full_data_customer$FLEET_TYPE, useNA ="ifany")# Create data frame for visualizationcluster_fleet_df <-as.data.frame.table(cluster_fleet_table)names(cluster_fleet_df) <-c("Segment", "FleetType", "Count")# Filter out NA values for cleaner visualizationcluster_fleet_df <- cluster_fleet_df %>%filter(!is.na(Segment) &!is.na(FleetType))# Calculate proportionscluster_fleet_df$Pct <- cluster_fleet_df$Count /ave(cluster_fleet_df$Count, cluster_fleet_df$Segment, FUN = sum)# Create percentage distribution plot for fleet types within clustersggplot(cluster_fleet_df, aes(x = Segment, y = Pct, fill = FleetType)) +geom_bar(stat ="identity", position ="fill", width =0.7) +scale_y_continuous(labels = scales::percent) +scale_fill_manual(values = fleet_colors) +# Use custom colors for fleet typesgeom_text(aes(label = scales::percent(Pct, accuracy =0.1)), position =position_fill(vjust =0.5), color ="black", size =3.5) +# Add percentage text labels inside the barslabs(title ="Fleet Type Distribution Across Clusters",subtitle ="Fleet type classification using a 400-gallon threshold",x ="Cluster",y ="Percentage",fill ="Fleet Type") +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),legend.position ="right",panel.grid.major.x =element_blank(),panel.grid.minor =element_blank() )
Code
# Define specific colors for the other variableshigh_growth_colors <-c("1"="#FF6347", "0"="#D3D3D3") # High Growth vs Low Growthfountain_only_colors <-c("1"="#4682B4", "0"="#D3D3D3") # Fountain Only vs Not Fountain Only# Create data frame for HIGH_GROW_POT visualizationcluster_high_growth_df <-as.data.frame.table(table(full_data_customer$CLUSTER, full_data_customer$HIGH_GROW_POT))names(cluster_high_growth_df) <-c("Segment", "HighGrowth", "Count")# Filter out NA values for cleaner visualizationcluster_high_growth_df <- cluster_high_growth_df %>%filter(!is.na(Segment) &!is.na(HighGrowth))# Calculate proportions for HIGH_GROW_POTcluster_high_growth_df$Pct <- cluster_high_growth_df$Count /ave(cluster_high_growth_df$Count, cluster_high_growth_df$Segment, FUN = sum)# Plot for HIGH_GROW_POT distribution by clustersggplot(cluster_high_growth_df, aes(x = Segment, y = Pct, fill = HighGrowth)) +geom_bar(stat ="identity", position ="fill", width =0.7) +scale_y_continuous(labels = scales::percent) +scale_fill_manual(values = high_growth_colors) +# Custom colors for HIGH_GROW_POTgeom_text(aes(label = scales::percent(Pct, accuracy =0.1)), position =position_fill(vjust =0.5), color ="black", size =3.5) +# Add percentage text labels inside the barslabs(title ="Clusters by Growth Potential",subtitle ="Proportional Representation by High Growth Potential",x ="Cluster",y ="Percentage",fill ="Growth Potential") +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),legend.position ="right",panel.grid.major.x =element_blank(),panel.grid.minor =element_blank() )
Code
# Create data frame for LOCAL_FOUNT_ONLY visualizationcluster_fountain_df <-as.data.frame.table(table(full_data_customer$CLUSTER, full_data_customer$LOCAL_FOUNT_ONLY))names(cluster_fountain_df) <-c("Segment", "FountainOnly", "Count")# Filter out NA values for cleaner visualizationcluster_fountain_df <- cluster_fountain_df %>%filter(!is.na(Segment) &!is.na(FountainOnly))# Calculate proportions for LOCAL_FOUNT_ONLYcluster_fountain_df$Pct <- cluster_fountain_df$Count /ave(cluster_fountain_df$Count, cluster_fountain_df$Segment, FUN = sum)# Plot for LOCAL_FOUNT_ONLY distribution by clustersggplot(cluster_fountain_df, aes(x = Segment, y = Pct, fill = FountainOnly)) +geom_bar(stat ="identity", position ="fill", width =0.7) +scale_y_continuous(labels = scales::percent) +scale_fill_manual(values = fountain_only_colors) +# Custom colors for LOCAL_FOUNT_ONLYgeom_text(aes(label = scales::percent(Pct, accuracy =0.1)), position =position_fill(vjust =0.5), color ="black", size =3.5) +# Add percentage text labels inside the barslabs(title ="Clusters by Fountain Only",subtitle ="Proportional Representation by Fountain Only",x ="Cluster",y ="Percentage",fill ="Fountain Only") +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),legend.position ="right",panel.grid.major.x =element_blank(),panel.grid.minor =element_blank() )
Code
# Define specific colors for the LOW_DEMAND_CUST variablelow_demand_colors <-c("1"="yellow", "0"="#D3D3D3") # Low Demand vs Not Low Demand# Create data frame for LOW_DEMAND_CUST visualizationcluster_low_demand_df <-as.data.frame.table(table(full_data_customer$CLUSTER, full_data_customer$LOW_DEMAND_CUST))names(cluster_low_demand_df) <-c("Segment", "LowDemand", "Count")# Filter out NA values for cleaner visualizationcluster_low_demand_df <- cluster_low_demand_df %>%filter(!is.na(Segment) &!is.na(LowDemand))# Calculate proportions for LOW_DEMAND_CUSTcluster_low_demand_df$Pct <- cluster_low_demand_df$Count /ave(cluster_low_demand_df$Count, cluster_low_demand_df$Segment, FUN = sum)# Plot for LOW_DEMAND_CUST distribution by clustersggplot(cluster_low_demand_df, aes(x = Segment, y = Pct, fill = LowDemand)) +geom_bar(stat ="identity", position ="fill", width =0.7) +scale_y_continuous(labels = scales::percent) +scale_fill_manual(values = low_demand_colors) +# Custom colors for LOW_DEMAND_CUSTgeom_text(aes(label = scales::percent(Pct, accuracy =0.1)), position =position_fill(vjust =0.5), color ="black", size =3.5) +# Add percentage text labels inside the barslabs(title ="Clusters by Low Demand Customers",subtitle ="Proportional Representation by Low Demand Customers",x ="Cluster",y ="Percentage",fill ="Low Demand Customers") +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),legend.position ="right",panel.grid.major.x =element_blank(),panel.grid.minor =element_blank() )
Code
# Define colors from the custom palette for clusters# Reshape data for facetingplot_data <-melt(full_data_customer, id.vars ="CLUSTER", measure.vars =c("RFM_SCORE", "NUM_ORDERS", "TOTAL_ORDERED"))# Create a new variable for log-transformed TOTAL_ORDEREDfull_data_customer$LOG_TOTAL_ORDERED <-log1p(full_data_customer$TOTAL_ORDERED) # log1p to handle zero values# Reshape data using tidyr::pivot_longer()plot_data <- full_data_customer %>%pivot_longer(cols =c(RFM_SCORE, NUM_ORDERS, LOG_TOTAL_ORDERED),names_to ="variable", values_to ="value")# Rename variable levels for better readabilityplot_data$variable <-case_when( plot_data$variable =="LOG_TOTAL_ORDERED"~"TOTAL_ORDERED (Log Scale)",TRUE~ plot_data$variable)# Create a boxplot with facet_wrapggplot(plot_data, aes(x =factor(CLUSTER), y = value, fill =factor(CLUSTER))) +geom_boxplot(color ="black", alpha =0.7) +# Add black borders for contrastscale_fill_manual(values = palette_clusters) +# Apply custom colors for clustersfacet_wrap(~ variable, scales ="free_y") +# Allow different y-scales per variablelabs(title ="Customer Segmentation Characterization", subtitle ="Distribution of RFM Score, Number of Orders, and Log-Transformed Total Ordered for each cluster",x ="Cluster", y ="Value" ) +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11, color ="gray30"),axis.title =element_text(face ="bold"),panel.grid.major =element_line(color ="gray90"),panel.grid.minor =element_blank(),legend.position ="none"# Remove legend since clusters are already labeled on the x-axis )
Code
# Prepare the datasetplot_data_filtered <- full_data_customer %>%pivot_longer(cols =c("OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI"),names_to ="variable", values_to ="value") %>%mutate(value =log1p(value)) # Log-transform values safely to handle zero values# Define custom labels for the variablescustom_labels <-c("OT_CALL.CENTER"="Call Center","OT_OTHER"="Other","OT_SALES.REP"="Sales Rep","OT_MYCOKE.LEGACY"="MyCoke Legacy","OT_MYCOKE360"="MyCoke360","OT_EDI"="EDI")# Generate the boxplotggplot(plot_data_filtered, aes(x =factor(CLUSTER), y = value, fill =factor(CLUSTER))) +geom_boxplot(color ="black", alpha =0.7) +# Add black borders for contrastscale_fill_manual(values = palette_clusters) +# Apply custom colors for clustersfacet_wrap(~ variable, scales ="fixed", labeller =labeller(variable =as_labeller(custom_labels))) +scale_y_continuous(limits =c(0, 6), breaks =seq(0, 6, 1)) +# Set fixed scale for y-axislabs(title ="Customer Segmentation Characterization", subtitle ="Distribution of orders by Order Type (Log Scale) for each cluster",x ="Cluster",y ="Log(Value)" ) +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11, color ="gray30"),axis.title =element_text(face ="bold"),panel.grid.major =element_line(color ="gray90"),panel.grid.minor =element_blank(),legend.position ="none"# Remove legend since clusters are already labeled on the x-axis )
Cluster 1 (Red): High Demand Customers
- Composition: Approximately 80% of customers receive deliveries via red trucks (based on the benchmark threshold of 400 gallons on average per year).
- Growth: Around 7% of customers exhibit high growth potential.
- Local Fountain Only: Only 1.5% of customers are local fountain-only.
- Average RFM: The average RFM score for this cluster is 29, the highest among the three clusters.
- Average Number of Orders: The average number of orders per customer was 81 in 2023 and 2024, with many outliers showing significantly higher order volumes.
- Total Ordered Volume: The average total ordered volume per customer in 2023 and 2024 is 4,638 gallons. This cluster has the highest number of outliers with elevated volumes, which skews the average. The volume representing the median is 1,707 gallons. - Volume Share: This cluster represents 76% of the total volume consumed in 2023 and 2024.
- It has the highest number of orders through digital channels and is the cluster most served by sales representatives.
Cluster 2 (Blue): Intermediate Customers with Growth Potential
- Composition: Approximately 87% of customers receive deliveries via white trucks (based on the benchmark threshold of 400 gallons on average per year).
- Growth: This cluster has the highest percentage of customers with high growth potential, at 16.6%.
- Local Fountain Only: Around 4.2% of customers are local fountain-only.
- Average RFM: The average RFM score for this group is 18.7, the second highest among the clusters.
- Average Number of Orders: The average number of orders per customer was 30 in 2023 and 2024.
- Total Ordered Volume: The average total ordered volume per customer in 2023 and 2024 is 525 gallons. The median volume is 331 gallons.
- Volume Share: This cluster represents approximately 22% of the total volume consumed in 2023 and 2024.
- It is the cluster with the highest average number of orders placed via the call center. It has fewer orders through digital channels compared to Cluster 2, but more than Cluster 1. The number of orders through sales representatives is similar to Cluster 1
Cluster 3 (Yellow): Less Active Customers with Low Order Volume
- Composition: Only 0.4% of customers receive deliveries via red trucks (based on the benchmark threshold of 400 gallons on average per year).
- Growth: Approximately 6% of customers exhibit high growth potential.
- Local Fountain Only: This cluster has the highest percentage of local fountain-only customers, at 7.5%.
- Average RFM: The average RFM score is 7, indicating these are the least active customers.
- Average Number of Orders: The average number of orders per customer was 5.5 in 2023 and 2024.
- Total Ordered Volume: The average total ordered volume per customer in 2023 and 2024 is around 80 gallons, while the median is 57 gallons, indicating a large number of customers with smaller volumes.
- Volume Share: This cluster represents only 1.7% of the total volume consumed in 2023 and 2024.
- The cluster shows orders concentrated through call centers, digital channels, and sales representatives, although in smaller absolute quantities compared to the other clusters.
8. Classification Models for Explaining Clusters
To better understand the variables influencing cluster composition and facilitate future predictions without the need for re-clustering, two classification models will be used: decision trees and multinomial logistic regression. These models will help identify the key characteristics that drive cluster formation.
By applying these models to new data, cluster assignments can be predicted, streamlining the analysis process and eliminating the need to recreate the clusters whenever new data is introduced.
8.1 Decision Tree
The selected variables will be analyzed to explain the clusters using a decision tree, with the dataset split into training and test sets, applying 20-fold cross-validation.
Code
# Prepare data for decision treemodel_data <- full_data_customer %>% dplyr::select(all_of(selected_vars), CLUSTER) %>%mutate(CLUSTER =as.factor(CLUSTER)) # Create train/test split (70% train, 30% test)set.seed(500) train_indices <-createDataPartition(model_data$CLUSTER, p =0.7, list =FALSE)train_data <- model_data[train_indices, ]test_data <- model_data[-train_indices, ]# Set up cross-validation (20-fold)train_control <-trainControl(method ="cv", number =20)# Train the decision tree model with cross-validationdecision_tree_model <-train( CLUSTER ~ ., data = train_data, method ="rpart", trControl = train_control,tuneLength =5) # Plot the decision treerpart.plot( decision_tree_model$finalModel, extra =101, box.palette ="Blues", shadow.col ="gray", nn =TRUE, main ="Decision Tree: Explaining Customer Clusters", branch.col ="gray", faclen =0,tweak =1.1)
Below are the prediction performance metrics:
Code
# Evaluate model performance on test setdt_test_predictions <-predict(decision_tree_model, test_data, type ="raw")dt_test_confusion_matrix <-confusionMatrix(dt_test_predictions, test_data$CLUSTER)# Calculate accuracy on the test setdt_test_accuracy <-round(mean(dt_test_predictions == test_data$CLUSTER), 2)# Evaluate model performance on train setdt_train_predictions <-predict(decision_tree_model, train_data, type ="raw")dt_train_accuracy <-round(mean(dt_train_predictions == train_data$CLUSTER), 2)# Print model performance metricscat("\n--- Decision Tree Model Performance ---\n")
--- Decision Tree Model Performance ---
Code
print(dt_test_confusion_matrix)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 1568 170 0
2 332 4336 72
3 1 270 2346
Overall Statistics
Accuracy : 0.9071
95% CI : (0.9009, 0.913)
No Information Rate : 0.5251
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8477
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 0.8248 0.9079 0.9702
Specificity 0.9764 0.9065 0.9594
Pos Pred Value 0.9022 0.9148 0.8964
Neg Pred Value 0.9547 0.8990 0.9889
Prevalence 0.2090 0.5251 0.2659
Detection Rate 0.1724 0.4767 0.2579
Detection Prevalence 0.1911 0.5212 0.2877
Balanced Accuracy 0.9006 0.9072 0.9648
Code
# Evaluate accuracy on train and test sets for decision treedt_train_acc <-round(mean(dt_train_predictions == train_data$CLUSTER), 2)dt_test_acc <-round(mean(dt_test_predictions == test_data$CLUSTER), 2)# Create comparison dataframedt_acc_comp <-data.frame(Set =c("Train", "Test"),Accuracy =c(dt_train_acc, dt_test_acc))# Display the formatted table with kabledt_acc_comp %>%kable(caption ="Decision Tree Accuracy Comparison (Train vs Test)", col.names =c("Dataset", "Accuracy")) %>%kable_styling(bootstrap_options =c("striped", "hover"))
Decision Tree Accuracy Comparison (Train vs Test)
Dataset
Accuracy
Train
0.91
Test
0.91
The model has an accuracy of 91% on both the train and test sets, demonstrating strong performance across all clusters. In Cluster 1: High Demand Customers, precision is 90.2% and recall is 82.5%. For Cluster 2: Intermediate Customers with Growth Potential, precision is 91.5% and recall is 90.8%. For Cluster 3: Less Active Customers with Low Order Volume, precision is 89.6% and recall is 97.0%.
Overall, the model performs well across all clusters, with strong precision and recall values for Cluster 1 and Cluster 3, and solid performance in Cluster 2. The accuracy comparison between the train and test sets is identical at 91%, indicating good generalization.
Code
# Extract and display variable importance from the trained decision tree modelvar_importance <- decision_tree_model$finalModel$variable.importancedt_var_importance_df <-data.frame(Variable =names(var_importance),Importance = var_importance)# Normalize importance valuesdt_var_importance_df <- dt_var_importance_df %>%mutate(Importance = Importance /max(Importance))# Sort by importance and visualize top 10 variablesdt_var_importance_df <- dt_var_importance_df %>%arrange(desc(Importance)) %>%head(10)# Plot the top 10 most important variablesggplot(dt_var_importance_df, aes(x =reorder(Variable, Importance), y = Importance)) +geom_bar(stat ="identity", fill ="seagreen") +coord_flip() +labs(title ="Top 10 Variables Explaining Customer Clusters",subtitle ="Decision Tree Variable Importance",x =NULL,y ="Relative Importance (Normalized)" ) +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),axis.title =element_text(face ="bold"),panel.grid.major.y =element_blank() )
The most important variables in the model were the number of orders per customer, average days between orders, RFM score, total ordered volume, total cost, and the low demand customers flag.
8.2 Multinomial Logistic Regression
The influence of the selected variables on customer clusters will be explored using multinomial logistic regression to predict the probabilities of new customers belonging to each of the established clusters. This method is well-suited for modeling the relationship between the predictors and the probabilities of customers being assigned to one of the three clusters, helping to assess the likelihood of a customer belonging to each specific group based on their characteristics.
Variable standardization and Elastic Net regularization will be used in the model development process.
Code
# Normalize predictorspreprocess_params <-preProcess(model_data, method =c("center", "scale"))model_data <-predict(preprocess_params, model_data)# Create train/test split (70% train, 30% test)set.seed(500) train_indices <-createDataPartition(model_data$CLUSTER, p =0.7, list =FALSE)train_data <- model_data[train_indices, ]test_data <- model_data[-train_indices, ]# Set up cross-validation train_control <-trainControl(method ="cv", number =10)# Define a smaller tuning grid for efficiencytune_grid <-expand.grid(alpha =0.5, lambda =seq(0.1, 1, length =5))# Train model with Elastic Net regularizationmlogistic_model <-train( CLUSTER ~ ., data = train_data, method ="glmnet",trControl = train_control,tuneGrid = tune_grid,control =list(maxit =200000),)# Print trained model summaryprint(mlogistic_model)
glmnet
21225 samples
21 predictor
3 classes: '1', '2', '3'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 19102, 19103, 19102, 19103, 19103, 19102, ...
Resampling results across tuning parameters:
lambda Accuracy Kappa
0.100 0.8948407 0.8217432
0.325 0.7363958 0.5030291
0.550 0.5250412 0.0000000
0.775 0.5250412 0.0000000
1.000 0.5250412 0.0000000
Tuning parameter 'alpha' was held constant at a value of 0.5
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were alpha = 0.5 and lambda = 0.1.
Code
# Make predictions on the test setmlogis_predictions <-predict(mlogistic_model, test_data)# Evaluate model performancemlogistic_model_performance <-postResample(pred = mlogis_predictions, obs = test_data$CLUSTER)print(mlogistic_model_performance)
Confusion Matrix and Statistics
Reference
Prediction 1 2 3
1 1346 2 0
2 555 4575 216
3 0 199 2202
Overall Statistics
Accuracy : 0.8931
95% CI : (0.8866, 0.8994)
No Information Rate : 0.5251
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8189
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: 1 Class: 2 Class: 3
Sensitivity 0.7080 0.9579 0.9107
Specificity 0.9997 0.8215 0.9702
Pos Pred Value 0.9985 0.8558 0.9171
Neg Pred Value 0.9284 0.9464 0.9677
Prevalence 0.2090 0.5251 0.2659
Detection Rate 0.1480 0.5030 0.2421
Detection Prevalence 0.1482 0.5878 0.2640
Balanced Accuracy 0.8539 0.8897 0.9404
Code
# Generate predictions on train settrain_predictions <-predict(mlogistic_model, train_data)# Evaluate accuracy on train and test setsmlogistic_train_acc <-round(postResample(pred = train_predictions, obs = train_data$CLUSTER)["Accuracy"], 2)mlogistic_test_acc <-round(postResample(pred = mlogis_predictions, obs = test_data$CLUSTER)["Accuracy"], 2)# Create comparison dataframemlogistic_acc_comp <-data.frame(Set =c("Train", "Test"),Accuracy =c(mlogistic_train_acc, mlogistic_test_acc))# Display the formatted tablemlogistic_acc_comp %>%kable(caption ="Multinomial Logistic Regression Accuracy Comparison (Train vs Test)", col.names =c("Dataset", "Accuracy")) %>%kable_styling(bootstrap_options =c("striped", "hover"))
Multinomial Logistic Regression Accuracy Comparison (Train vs Test)
Dataset
Accuracy
Train
0.89
Test
0.89
The model achieved an accuracy of 89.3% on the test set, reflecting strong performance. In Cluster 1 (Red): High Demand Customers, recall is 70.8% and precision is 99.8%. For Cluster 2 (Blue): Intermediate Customers with Growth Potential, recall is 95.8% and precision is 85.6%. Finally, Cluster 3 (Yellow): Less Active Customers with Low Order Volume shows recall of 91 % and precision of 91.7%. Overall, the model performs well, with Cluster 2 showing the highest recall and Cluster 1 having the strongest precision.
The relatively low recall in Cluster 1 (Red) (70.8%) suggests that the model may not always correctly identify customers in this group, leading to false negatives.
Code
# Extract variable importance from the multinomial modelvariable_importance <-varImp(mlogistic_model, scale =TRUE)# Define custom colors for the clusterspalette_clusters <-c("1"="#FF6347", # Coral"2"="#4682B4", # Cornflower blue"3"="#FFD700") # Yellow# Extract importance datavar_imp_df <-as.data.frame(variable_importance$importance)var_imp_df$Variable <-rownames(var_imp_df)# Convert to long formatvar_imp_long <-melt(var_imp_df, id.vars ="Variable", variable.name ="Cluster", value.name ="Importance")# Clean up cluster names (remove 'Overall' if present)var_imp_long$Cluster <-gsub("Overall", "", var_imp_long$Cluster)# Keep only top 10 variables per cluster for better visualizationtop_vars <- var_imp_long %>%group_by(Cluster) %>%top_n(10, Importance) %>%ungroup()# Create visualization with custom cluster colors# Create visualization with custom cluster colors and no color legendggplot(var_imp_long, aes(x =reorder(Variable, Importance), y = Importance, fill = Cluster)) +geom_bar(stat ="identity") +coord_flip() +facet_wrap(~ Cluster, scales ="free_x") +scale_fill_manual(values = palette_clusters) +# Apply custom colors# Set the y-axis (importance) to have the same scale 0-100 for all facetsscale_y_continuous(limits =c(0, 100)) +labs(title ="Multinomial Logistic Regression",subtitle ="Variable Importance by Cluster",x ="Variables",y ="Importance" ) +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),strip.text =element_text(size =12, face ="bold"),axis.title =element_text(face ="bold"),axis.text.y =element_text(size =9),panel.grid.major.y =element_blank() ) +guides(fill ="none") # Remove the color legend
The model indicates that:
For Cluster 1, the key variables included the number of orders, RFM score, order type (MyCoke Legacy), order type (MyCoke 360), order type (using sales representatives), and chain member.
For Cluster 2, the most significant variables were the average number of days between orders, low demand customers, order type (call center), order type (MyCoke Legacy), and channel growth potential.
For Cluster 3, the most important variables were the average number of days between orders, low demand customers, RFM score, and days since the first delivery.
The models created to predict clusters for new customers performed well and provide insights that clearly help in understanding the characteristics influencing the clusters. Therefore, we can proceed with the final analysis for fleet assignment.
9. Data driven fleet assingment
Based on all the previous analyses, it is concluded that the fleet type designated for clients should be defined by considering different criteria, not just the average annual volume demand.
The main criteria shaping this approach include the similarities among clients represented by the clusters, the analysis of volume distributions by cold drink channel segment, and the growth potential of the clients.
Before proceeding, the relationship between the 400 gallons annual threshold for each cluster will be analyzed.
Code
# Define custom colors for the clusterspalette_clusters <-c("1"="#FF6347", # Coral"2"="#4682B4", # Cornflower blue"3"="#FFD700"# Yellow)# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Create scatter plot with log scales, separated by CLUSTER using facet_wrapggplot() +geom_jitter(data = full_data_customer, aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color =as.factor(CLUSTER)),alpha =0.5, width =0.2) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +# Log scale for x-axis with specific limits (1 -> 500)scale_y_log10(limits =c(10, 1000000), # Set limits for the y-axis from 10 to 1,000,000breaks =c(10, 100, 1000, 10000, 100000, 1000000), # Custom breaks for the Y-axislabels = scales::comma # Format numbers with commas ) +scale_color_manual(values = palette_clusters, name ="Cluster") +scale_linetype_manual(values ="solid", name ="") +# Add the threshold line to legendlabs(title ="Average Annual Consumption vs. Number of Orders Cluster",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ CLUSTER, scales ="fixed") +# Ensure same scale across all facetstheme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =9), # Adjust facet labels' sizepanel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), # Major Y grid lines for the specific breakspanel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), # Light gray vertical grid lines as backgroundpanel.grid.minor =element_blank(), # Remove minor grid linespanel.background =element_rect(fill ="white", color ="white"), # Ensure clean backgroundlegend.position ="right"# Move legend to right side )
Regarding the 400-gallon benchmark for defining clients to be served by red trucks, it is possible to note that:
Cluster 1: This cluster mainly selects clients with higher demand volumes or a larger number of orders. A smaller portion of these clients would fall below the 400-gallon threshold, with some still close to a minimum of 100 gallons.
Cluster 2: This cluster has large number of clients above and below the threshold, so it requires further refinement.
Cluster 3: The vast majority of clients fall below the threshold. However, the few clients above it tend to place a small number of orders per year.
9.1 Cluster 2 Analysis for Fleet Assignment
Cluster 2 comprises just over half of all clients, making it difficult to define clear criteria for fleet designation.
The multinomial regression model indicated that the variable “Average Days Between Orders” (AVG_DAYS_BET_ORD) was the most important, while in the decision tree model, it was the second most important variable.
Therefore, below is the plot showing the relationship between the average annual consumption of each client and their average days between orders.
Code
# Define custom colors for the clusterspalette_clusters <-c("1"="#FF6347", # Coral"2"="#4682B4", # Cornflower blue"3"="#FFD700"# Yellow)# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Create scatter plot with log scale for x-axis, separated by CLUSTER using facet_wrapggplot() +geom_jitter(data = full_data_customer, aes(x = AVG_DAYS_BET_ORD, y = AVG_ANNUAL_CONSUMP, color =as.factor(CLUSTER)),alpha =0.5, width =0.2) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 1000), # Set limits for the x-axisbreaks =c(10, 100, 1000), # Custom breaks for the X-axislabels = scales::comma # Format numbers with commas ) +scale_y_log10(limits =c(10, 1000000), # Set limits for the y-axis from 10 to 1,000,000breaks =c(10, 100, 1000, 10000, 100000, 1000000), # Custom breaks for the Y-axislabels = scales::comma # Format numbers with commas ) +scale_color_manual(values = palette_clusters, name ="Cluster") +scale_linetype_manual(values ="solid", name ="") +# Add the threshold line to legendlabs(title ="Avg. Annual Consumption vs. Avg. Days Between Orders by Cluster",x ="Average Days Between Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ CLUSTER, scales ="fixed") +# Ensure same scale across all facetstheme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =9), # Adjust facet labels' sizepanel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), # Major Y grid lines for the specific breakspanel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), # Light gray vertical grid lines as backgroundpanel.grid.minor =element_blank(), # Remove minor grid linespanel.background =element_rect(fill ="white", color ="white"), # Ensure clean backgroundlegend.position ="right"# Move legend to right side )
Code
# Filter the data for Cluster 2cluster_2_data <- full_data_customer %>%filter(CLUSTER ==2)# Calculate deciles for AVG_DAYS_BET_ORDdeciles <-quantile(cluster_2_data$AVG_DAYS_BET_ORD, probs =seq(0, 1, 0.1))# Create a simple data frame with the decile values, transposing it for horizontal displaydecile_table <-data.frame(Decile =paste0(seq(0, 90, 10), "%"),Lower_Bound = deciles[-length(deciles)], # All but the last quantile valueUpper_Bound = deciles[-1] # All but the first quantile value)# Print the decile table horizontallydecile_table_t <-t(decile_table[,-1])colnames(decile_table_t) <- decile_table$Deciledecile_table_t
# Get a summary of AVG_DAYS_BET_ORD for Cluster 2summary_cluster_2 <-summary(cluster_2_data$AVG_DAYS_BET_ORD)# Round the summary summary_cluster_2_rounded <-round(summary_cluster_2, 0)# Display the summary in a simple formatsummary_table <-data.frame(Statistic =names(summary_cluster_2_rounded),Value =as.vector(summary_cluster_2_rounded))# Print the summary tablesummary_table
Statistic Value
1 Min. 3
2 1st Qu. 16
3 Median 24
4 Mean 56
5 3rd Qu. 37
6 Max. 731
When filtering the average days between orders for Cluster 2, it is observed that 60 percent of customers have an average of 33 days or fewer between orders. The group’s average is 56.4 days, with a median of 24 days.
Building upon the previously calculated variables, low demand customers and high growth potential customers, additional criteria relevant to the business will be introduced to better segment customers within Cluster 2.
These new criteria include an average annual consumption greater than 1,349 gallons and an average of 52 or fewer days between orders. The first threshold was chosen because it represents the point at which delivery costs are minimized. The second threshold was selected due to its significant influence on the clustering model, and because customers with high growth potential (excluding low demand customers) and an average time between orders of 33 days or fewer—representing nearly two-thirds of customers—are believed to have the potential to order more frequently, thus reducing the order interval.
As a result, in the plot below, customers who are not low demand, show high growth potential, or have an average annual consumption greater than 1,349 gallons and an average of 33 or fewer days between orders will be classified as Emerging Opportunities and assigned to the red truck category.
Code
# Define custom colorspalette_clusters <-c("Emerging Opportunities - RED TRUCK"="#B33951", # Emerging Opportunities"General Clients - WHITE TRUCK"="#D3D3D3"# General Clients)# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Filter only Cluster 2 and create category for facetingfiltered_data <- full_data_customer %>%filter(CLUSTER ==2) %>%mutate(Category =ifelse(LOW_DEMAND_CUST ==0& HIGH_GROW_POT ==1& AVG_DAYS_BET_ORD <=33| AVG_ANNUAL_CONSUMP >1349, "Emerging Opportunities - RED TRUCK", "General Clients - WHITE TRUCK") )# Create scatter plot with facet_wrapggplot(filtered_data) +geom_jitter(aes(x = AVG_DAYS_BET_ORD, y = AVG_ANNUAL_CONSUMP, color = Category), width =0.2, alpha =0.5) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 1000)) +# Log scale for x-axis with specific limits (1 -> 500)scale_y_log10(limits =c(10, 1000000), # Set limits for the y-axis from 10 to 1,000,000breaks =c(10, 100, 1000, 10000, 100000, 1000000), # Custom breaks for the Y-axislabels = scales::comma # Format numbers with commas ) +scale_color_manual(values = palette_clusters, name ="Fleet Assignment") +scale_linetype_manual(values ="solid", name ="") +# Add the threshold line to legendlabs(title ="Cluster 2 - Avg. Annual Consumption vs. Avg. Days Between Orders",x ="Average Days Between Orders (Log Scale)",y ="Avg Annual Consumption (Log Scale)" ) +facet_wrap(~ Category, scales ="fixed") +# Separate categories side by sidetheme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),panel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.minor =element_blank(), # Remove minor grid linespanel.background =element_rect(fill ="white", color ="white"), legend.position ="right", # Move legend to right sidestrip.text =element_blank() # Remove facet titles )
Below, the impact of fleet assignment on each cold drink channel will be explored using the previous criteria for Cluster 2, and its relation to average annual consumption and the number of orders will be analyzed.
Code
# Define color palettepalette_clusters <-c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")# Filter "RED TRUCK" data from full_data_customercluster_2_red_truck_data <-subset(full_data_customer, !is.na(CLUSTER_2_FLEET) & CLUSTER_2_FLEET =="RED TRUCK")# Create a data frame for the threshold line (fixed at 400 gallons)threshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Create the plotggplot() +geom_jitter(data = cluster_2_red_truck_data, aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color =as.factor(CLUSTER_2_FLEET)),alpha =0.5, width =0.2) +geom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +scale_y_log10(limits =c(10, 100000), breaks =c(10, 100, 1000, 10000, 100000), labels = comma ) +scale_color_manual(values = palette_clusters, name ="Fleet Assignment") +scale_linetype_manual(values ="solid", name =NULL) +labs(title ="Average Annual Consumption vs. Number of Orders for Cluster 2",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ COLD_DRINK_CHANNEL, scales ="fixed") +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =10), panel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.minor =element_blank(), panel.background =element_rect(fill ="white", color ="white"), legend.position ="right" )
In an effort to explore growth opportunities, almost all sectors would have a considerable number of clients with a volume of less than 400 gallons but using red trucks.
Code
# Define color palettepalette_clusters <-c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")# Filter "WHITE TRUCK" data from full_data_customercluster_3_white_truck_data <-subset(full_data_customer, !is.na(CLUSTER_2_FLEET) & CLUSTER_2_FLEET =="WHITE TRUCK")# Create a data frame for the threshold line (fixed at 400 gallons)threshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Create the plotggplot() +geom_jitter(data = cluster_3_white_truck_data, aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color =as.factor(CLUSTER_2_FLEET)),alpha =0.5, width =0.2) +geom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +scale_y_log10(limits =c(10, 100000), breaks =c(10, 100, 1000, 10000, 100000), labels = comma ) +scale_color_manual(values = palette_clusters, name ="Fleet Assignment") +scale_linetype_manual(values ="solid", name =NULL) +labs(title ="Average Annual Consumption vs. Number of Orders for Cluster 2",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ COLD_DRINK_CHANNEL, scales ="fixed") +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =10), panel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.minor =element_blank(), panel.background =element_rect(fill ="white", color ="white"), legend.position ="right" )
On the other hand, the criteria naturally assign white trucks to a large number of clients with an average annual volume of less than 400 gallons in each segment, while still ensuring that high-volume clients are served by red trucks.
Also, the previous graphs represent an opportunity for the company to develop targeted strategies for each segment.
9.2 Fleet Assignment Criteria
Based on the analysis, the recommended fleet assignment will be determined by the following criteria:
Customers with an average annual consumption greater than 1349 gallons will be assigned to RED TRUCKS.
Low-demand customers (identified by LOW_DEMAND_CUST == 1) will be assigned to WHITE TRUCKS.
All customers in Cluster 1 will be assigned to RED TRUCKS, according to the previous rules.
All customers in Cluster 3 will be assigned to WHITE TRUCKS, after applying the previous rules.
Customers in Cluster 2 will be assigned to RED TRUCKS if they meet at least one of the following conditions:
They are classified as high growth potential (HIGH_GROW_POT == 1).
Their average days between orders are less than or equal to 33 (AVG_DAYS_BET_ORD <= 33).
The remaining customers in Cluster 2 will be assigned to WHITE TRUCKS.
Any customers who do not meet any of these criteria will remain unclassified (NA).
Code
# Assign customers to RED TRUCK or WHITE TRUCK based on specified criteriafull_data_customer <- full_data_customer %>%mutate(NEW_FLEET =case_when( AVG_ANNUAL_CONSUMP >1349~"RED TRUCK", # Customers with high annual consumption LOW_DEMAND_CUST ==1~"WHITE TRUCK", # Low-demand customers CLUSTER ==1~"RED TRUCK", # Cluster 1 customers CLUSTER ==3~"WHITE TRUCK", # Cluster 3 customers CLUSTER ==2& (LOW_DEMAND_CUST ==0& HIGH_GROW_POT ==1& AVG_DAYS_BET_ORD <=33) ~"RED TRUCK", # Cluster 2 customers with all conditions met CLUSTER ==2~"WHITE TRUCK", # Remaining Cluster 2 customersTRUE~NA_character_# Others remain NA ) )
Below are the representations of the clusters and the designated fleet.
Code
# Define custom colors for the fleet and clusterspalette_fleet <-c("RED TRUCK"="#B33951", # Red truck"WHITE TRUCK"="#D3D3D3"# White truck)palette_clusters <-c("Cluster 1"="#FF6347", # Coral"Cluster 2"="#4682B4", # Cornflower blue"Cluster 3"="#FFD700"# Yellow)# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Filter the data for RED TRUCK and WHITE TRUCKred_truck_data <- full_data_customer %>%filter(NEW_FLEET =="RED TRUCK")white_truck_data <- full_data_customer %>%filter(NEW_FLEET =="WHITE TRUCK")# Combine both datasets to differentiate them in facet_wrapcombined_data <-bind_rows( red_truck_data %>%mutate(Fleet_Type ="RED TRUCK"), white_truck_data %>%mutate(Fleet_Type ="WHITE TRUCK"))# Define a custom labeller for the clusterscustom_labeller <-labeller(CLUSTER =c("1"="Cluster 1","2"="Cluster 2","3"="Cluster 3" ))# Create scatter plot with log scales and no background color for facet labelsggplot(combined_data) +geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = Fleet_Type),alpha =0.5, width =0.2) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +# Log scale for x-axis with specific limitsscale_y_log10(limits =c(10, 1000000),breaks =c(10, 100, 1000, 10000, 100000, 1000000),labels = scales::comma ) +scale_color_manual(values = palette_fleet) +scale_linetype_manual(values ="solid", name ="") +labs(title ="Fleet Assignment by Cluster",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ CLUSTER + Fleet_Type, scales ="fixed", nrow =1, labeller = custom_labeller) +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =9), # Adjust text size for facet labelsstrip.background =element_blank(), # Remove background color from facet labelspanel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.minor =element_blank(),panel.background =element_rect(fill ="white", color ="white"),legend.position ="bottom",legend.box ="vertical" ) +guides(color ="none") # Remove legend for Fleet_Type
Code
# Summmary#summary(as.factor(full_data_customer$NEW_FLEET))#summary(as.factor(full_data_customer$FLEET_TYPE))# RED TRUCK WHITE TRUCK # 7239 23081
The new criteria established labels for all customers. A total of 7,926 customers were assigned to “Red Truck”, while 22,394 customers were assigned to “White Truck”.
The annual average consumption criterion of 400 gallons would have assigned 7,239 customers to be served by “Red Truck” and 23,081 customers to be served by “White Trucks”.
Therefore, 687 clients who were previously served by white trucks and who present higher growth potential will now be served by red trucks.
Code
# Create a combined summary for both fleet types with percentages calculated separately by fleet_designationsummary_fleet_comparison_percent <- full_data_customer %>%# Create a longer dataset with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Convert to factorsmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) ) %>%# Group by fleet_designation and fleet_value, then calculate countgroup_by(fleet_designation, fleet_value) %>%summarise(count =n(), .groups ='drop') %>%# Calculate the percentage for each fleet_designationgroup_by(fleet_designation) %>%mutate(percentage = (count /sum(count)) *100) %>%ungroup()# Plot with facet_wrap and custom background showing percentage valuesggplot(summary_fleet_comparison_percent, aes(x = fleet_value, y = percentage, fill = fleet_value)) +# Add background based on facetgeom_rect(data =data.frame(fleet_designation ="Over 400 gallons threshold"),aes(xmin =-Inf, xmax =Inf, ymin =-Inf, ymax =Inf),fill ="lightyellow", alpha =0.3, inherit.aes =FALSE) +# Light gray background for "Over 400 gallons threshold"geom_bar(stat ="identity", position ="dodge", alpha =0.8) +geom_text(aes(label = scales::comma(percentage, accuracy =0.1, suffix ="%")), position =position_dodge(width =0.8), vjust =-0.5, size =3.5) +facet_wrap(~ fleet_designation, scales ="fixed") +# Fixed scale for both facetslabs(title ="Comparison of Customer Distribution by Fleet Type Designation",x ="Fleet Type") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =9),axis.title.y =element_text(size =10),axis.text.x =element_text(size =10),axis.title.x =element_blank(),legend.position ="none",panel.grid.major.x =element_blank(),strip.text =element_text(size =11, face ="bold"),strip.background =element_rect(fill ="white", color =NA),panel.spacing =unit(1, "lines") )
According to the criteria, 26% of customers would be served by red trucks and 74% by white trucks.
Code
# Creating a combined summary for both fleet typessummary_fleet_comparison_absolute <- full_data_customer %>%# Create a longer dataset with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Convert to factorsmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) ) %>%# Group and calculate total counts (absolute numbers)group_by(fleet_designation, LOCAL_FOUNT_ONLY, fleet_value) %>%summarise(count =n()) %>%ungroup()# Plot with facet_wrap and custom background showing total customer countsggplot(summary_fleet_comparison_absolute, aes(x = LOCAL_FOUNT_ONLY, y = count, fill = fleet_value)) +# Add background based on facetgeom_rect(data =data.frame(fleet_designation ="Over 400 gallons threshold"),aes(xmin =-Inf, xmax =Inf, ymin =-Inf, ymax =Inf),fill ="lightyellow", alpha =0.3, inherit.aes =FALSE) +# Light gray background for "Over 400 gallons threshold"geom_bar(stat ="identity", position ="dodge", alpha =0.8) +geom_text(aes(label = scales::comma(count)), position =position_dodge(width =0.8), vjust =-0.5, size =3.5) +facet_wrap(~ fleet_designation, scales ="fixed") +# Fixed scale for both facetslabs(title ="Comparison of Number of Customers by Fleet Type Designation",x ="Customer Type") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =9),axis.title.y =element_text(size =10),axis.text.x =element_text(size =10, color ="black"),axis.title.x =element_blank(),legend.position ="bottom",legend.title =element_text(face ="bold"),panel.grid.major.x =element_blank(),strip.text =element_text(size =11, face ="bold"),strip.background =element_rect(fill ="white", color =NA),panel.spacing =unit(1, "lines") )
Considering only the “Others” group (customers who order from multiple sources), our recommendation would result in 729 additional stores being served by red trucks, compared to the 400-gallon threshold—an increase of 10.3%.
In contrast, within the ‘Local Fountain Only’ group, the number of customers served by red trucks would decrease by 42, representing a 23.2% reduction.
Code
# Creating the summary data with both fleet designationssummary_volume_comparison <- full_data_customer %>%# Calculate total volume per customermutate(total_volume = QTD_DLV_CA_2023 + QTD_DLV_GAL_2023 + QTD_DLV_CA_2024 + QTD_DLV_GAL_2024) %>%# Create a longer format dataset with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Group by fleet designation, value, and LOCAL_FOUNT_ONLYgroup_by(fleet_designation, fleet_value, LOCAL_FOUNT_ONLY) %>%# Sum volumes within each groupsummarise(group_volume =sum(total_volume, na.rm =TRUE)) %>%ungroup() %>%# Calculate total volume for percentagegroup_by(fleet_designation) %>%mutate(total_designation_volume =sum(group_volume),percentage = group_volume / total_designation_volume *100) %>%ungroup() %>%# Convert to factors for proper orderingmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) )# Create the faceted chartggplot(summary_volume_comparison, aes(x = LOCAL_FOUNT_ONLY, y = percentage, fill = fleet_value)) +# Add background based on facetgeom_rect(data =data.frame(fleet_designation ="Over 400 gallons threshold"),aes(xmin =-Inf, xmax =Inf, ymin =-Inf, ymax =Inf),fill ="lightyellow", alpha =0.3, inherit.aes =FALSE) +# Light yellow background for Over 400 gallons thresholdgeom_bar(stat ="identity", position ="dodge", alpha =0.8) +geom_text(aes(label = scales::comma(percentage, accuracy =0.1, suffix ="%")), position =position_dodge(width =0.9), vjust =-0.5, size =3) +facet_wrap(~ fleet_designation, scales ="fixed") +# Fixed scale for both facetslabs(title ="Comparison of Volume Distribution by Fleet Type Designation",y ="Percentage of Total Volume") +# Set colors - assuming similar colors for both designationsscale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3"),name ="Fleet Type") +# Set x-axis labels scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =9),axis.title.y =element_text(size =10),axis.text.x =element_text(size =10, angle =0),axis.title.x =element_blank(),legend.position ="bottom",panel.grid.major.x =element_blank(),strip.text =element_text(size =11, face ="bold") )
Although the number of customers served by red trucks has increased, the overall volume transported remains relatively stable.
Within the “Others” customer group, there would be a reduction of approximately 1,038,637 gallons over two years, representing a 3.4% decrease. This volume would now be delivered by white trucks.
For the “Local Fountain Only” group, the reduction in volume transported by red trucks is around 104,895 gallons over two years a 31% decrease.
Despite the increase in the number of customers served by red trucks, which may lead to higher travel times and costs, the recommendation optimizes the delivery system by allowing red trucks to focus on strategic customers while reducing overall costs through higher-volume deliveries using white trucks.
A geographic distribution analysis of the customer base can be carried out at a later stage. One opportunity that emerges from this recommendation is to encourage customers within the same ZIP code to coordinate delivery dates. This would help consolidate volumes, streamline the delivery process, and further reduce operational costs.
Below is the average number of days between orders for each group.
Code
# Creating a combined summary for both fleet types with mean AVG_DAYS_BET_ORDsummary_fleet_comparison_absolute <- full_data_customer %>%# Create a longer dataset with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Convert to factorsmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) ) %>%# Group by relevant factors and calculate the mean of AVG_DAYS_BET_ORD (average days between orders)group_by(fleet_designation, LOCAL_FOUNT_ONLY, fleet_value) %>%summarise(mean_days_bet_ord =mean(AVG_DAYS_BET_ORD, na.rm =TRUE)) %>%ungroup()# Plot with facet_wrap and custom background showing mean AVG_DAYS_BET_ORDggplot(summary_fleet_comparison_absolute, aes(x = LOCAL_FOUNT_ONLY, y = mean_days_bet_ord, fill = fleet_value)) +# Add background based on facetgeom_rect(data =data.frame(fleet_designation ="Over 400 gallons threshold"),aes(xmin =-Inf, xmax =Inf, ymin =-Inf, ymax =Inf),fill ="lightyellow", alpha =0.3, inherit.aes =FALSE) +# Light yellow background for "Over 400 gallons threshold"geom_bar(stat ="identity", position ="dodge", alpha =0.8) +geom_text(aes(label = scales::number(mean_days_bet_ord, accuracy =0.1)), position =position_dodge(width =0.8), vjust =-0.5, size =3.5) +facet_wrap(~ fleet_designation, scales ="fixed") +# Fixed scale for both facetslabs(title ="Comparison of Mean Days Between Orders by Fleet Type Designation",x ="Customer Type") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3"),name ="Fleet Type") +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =9),axis.title.y =element_text(size =10),axis.text.x =element_text(size =10, color ="black"),axis.title.x =element_blank(),legend.position ="bottom",legend.title =element_text(face ="bold"),panel.grid.major.x =element_blank(),strip.text =element_text(size =11, face ="bold"),strip.background =element_rect(fill ="white", color =NA),panel.spacing =unit(1, "lines") )
The red trucks should be optimized to serve the “Others” group, which has an average order interval of 14 days, compared to 23 days under the 400-gallon threshold model. The difference for the “Local Fountain Only” group in relation to the white trucks would be approximately 2 days.
The white trucks, on the other hand, would serve more sporadic customers, with an average interval of over 260 days between orders.
10. Recommendation Impacts
10.1 Impact on Costs
The cost impact of using red trucks is significantly higher compared to white trucks. For OPEX, the delivery cost for red trucks is approximately 700% more than for white trucks when considering only variable costs.
The calculated cost for the total volume delivered to each customer via red trucks is represented in the column total_cos_ca_gal. To provide conservative estimates, a 400% difference is assumed, and the red truck cost is divided by 5 to estimate the cost for white trucks, represented by ARTM_TOTAL_COST.
Below is the cost comparison.
Code
# Reducing the TOTAL_COST_CA_GAL by full_data_customer <- full_data_customer %>%mutate(ARTM_TOTAL_COST = TOTAL_COST_CA_GAL /5)# Creating the summary data with both fleet designations for delivery cost analysissummary_delivery_cost_comparison <- full_data_customer %>%# Create a column that has the appropriate cost based on fleet typemutate(delivery_cost =case_when( FLEET_TYPE =="WHITE TRUCK"~ ARTM_TOTAL_COST, FLEET_TYPE =="RED TRUCK"~ TOTAL_COST_CA_GAL,TRUE~NA_real_ )) %>%# Reshape the data into a longer format with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Update the delivery cost calculation for NEW_FLEET designationmutate(delivery_cost =case_when( fleet_designation =="NEW_FLEET"& fleet_value =="WHITE TRUCK"~ ARTM_TOTAL_COST, fleet_designation =="NEW_FLEET"& fleet_value =="RED TRUCK"~ TOTAL_COST_CA_GAL, fleet_designation =="FLEET_TYPE"& fleet_value =="WHITE TRUCK"~ ARTM_TOTAL_COST, fleet_designation =="FLEET_TYPE"& fleet_value =="RED TRUCK"~ TOTAL_COST_CA_GAL,TRUE~ delivery_cost )) %>%# Group by fleet designation, value, and LOCAL_FOUNT_ONLYgroup_by(fleet_designation, fleet_value, LOCAL_FOUNT_ONLY) %>%# Sum delivery costs in each groupsummarise(total_delivery_cost =sum(delivery_cost, na.rm =TRUE)) %>%ungroup() %>%# Convert to factors for proper orderingmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) )# Create the faceted chartggplot(summary_delivery_cost_comparison, aes(x = LOCAL_FOUNT_ONLY, y = total_delivery_cost, fill = fleet_value)) +# Add background based on facetgeom_rect(data =data.frame(fleet_designation ="Over 400 gallons threshold"),aes(xmin =-Inf, xmax =Inf, ymin =-Inf, ymax =Inf),fill ="lightyellow", alpha =0.3, inherit.aes =FALSE) +# Light yellow background for Over 400 gallons thresholdgeom_bar(stat ="identity", position ="dodge", alpha =0.8) +geom_text(aes(label = scales::dollar(total_delivery_cost, accuracy =1)), position =position_dodge(width =0.9), vjust =-0.5, size =3) +facet_wrap(~ fleet_designation, scales ="free_y") +# Allow y-axis to vary between facets if necessarylabs(title ="Comparison of Delivery Cost by Fleet Type Designation",y ="Total Delivery Cost ($ Millions)") +# Set colors for the fleet typesscale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3"),name ="Fleet Type") +# Set x-axis labels scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +# Scale y-axis to display values in millionsscale_y_continuous(labels = scales::label_number(scale =1e-6, suffix ="M")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =9),axis.title.y =element_text(size =10),axis.text.x =element_text(size =10, angle =0),axis.title.x =element_blank(),legend.position ="bottom",panel.grid.major.x =element_blank(),strip.text =element_text(size =11, face ="bold") )
Regarding the 400-gallon threshold, over a two-year period the estimated differences would be:
Others – Red Truck: cost reduction of $803,612 (2%);
Others – White Truck: cost increase of $160,722 (3%);
Local Fountain Only – Red Truck: cost reduction of $161,079 (34%);
Local Fountain Only – White Truck: cost increase of $32,216 (22%).
The total cost using the 400-gallon threshold over two years would be $46,462,823, while the recommendation totals $45,691,071. The net result over these two years would be a total savings of $771,752, representing a 1.7% reduction compared to the original 400-gallon threshold strategy.
These values were calculated based on actual historical delivery volumes. Predicting whether these savings will continue in the future is highly uncertain due to many potential influencing factors—such as economic shifts, customer reactions, competitor strategies, and more. Additionally, the limited historical data (only two years) adds uncertainty to future projections.
10.2 Impact on Fleet Assignment by Cold Drink Channel
With the recommendation, the dining segment saw a 7% reduction in the number of customers previously served by red trucks, who are now being served by white trucks.
Events and Public Sector experienced a near 5% reduction in customers served by red trucks. The remaining segments saw changes of less than 2%.
The conventional segment was not displayed due to the low volume, but the change in this segment was also less than 2%.
10.3 Impact on Order Types
Code
# Merge new fleet on full_datafull_data <- full_data %>%left_join(full_data_customer %>% dplyr::select(CUSTOMER_NUMBER, NEW_FLEET), by ="CUSTOMER_NUMBER")# Summarize by ORDER_TYPE and NEW_FLEET using delivered volumedata_summary_fleet_by_order <- full_data %>%filter(!is.na(NEW_FLEET), !is.na(ORDER_TYPE)) %>%group_by(ORDER_TYPE, NEW_FLEET) %>%summarise(TotalDelivered =sum(DELIVERED_CASES + DELIVERED_GALLONS, na.rm =TRUE), .groups ="drop") %>%group_by(ORDER_TYPE) %>%mutate(Percentage =round(TotalDelivered /sum(TotalDelivered) *100, 0))# Order ORDER_TYPE by total delivered volumeorder_levels <- data_summary_fleet_by_order %>%group_by(ORDER_TYPE) %>%summarise(Total =sum(TotalDelivered), .groups ="drop") %>%arrange(Total) %>%pull(ORDER_TYPE)# Reorder as factordata_summary_fleet_by_order$ORDER_TYPE <-factor(data_summary_fleet_by_order$ORDER_TYPE, levels = order_levels)# Plotggplot(data_summary_fleet_by_order, aes(x = TotalDelivered, y = ORDER_TYPE, fill = NEW_FLEET)) +geom_bar(stat ="identity", position ="stack", alpha =0.6) +geom_text(aes(label =paste0(Percentage, "%")), position =position_stack(vjust =0.5), hjust =0, color ="black", size =3.2) +labs(title ="Our Recommendation - Delivered Volume by Order Type", x ="Volume (units)", y =NULL, fill ="New Fleet Type") +scale_x_continuous(labels =function(x) paste0(x /1e6, "M"),breaks =c(2500000, 5000000, 7500000, 10000000),expand =expansion(c(0, 0.05)) ) +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.title.x =element_text(size =10, face ="plain"),legend.position ="right",legend.direction ="vertical",panel.grid.major.y =element_blank(),panel.grid.major.x =element_line(color ="lightgray", size =0.5),panel.grid.minor =element_blank() )
The key takeaway here is that the volume served by sales reps would see only a slight reduction of about 2 percent compared to the 400 gallon threshold. This helps avoid abrupt changes that could potentially harm relationships with customers who have closer contact with our sales team.
The most significant shift however would occur with orders placed through the call center. Approximately 20 percent of the volume that would have been served by red trucks under the 400 gallon threshold would now be served by white trucks. This allows red trucks to be redirected to other types of orders with greater potential to strengthen customer relationships.
10.4 Customers Impacted
All Customers
Code
# Create WHITE_TO_RED:# Assign 0 if both FLEET_TYPE and NEW_FLEET are "WHITE TRUCK", otherwise assign 1full_data_customer$WHITE_TO_RED <-ifelse( full_data_customer$FLEET_TYPE =="WHITE TRUCK"& full_data_customer$NEW_FLEET =="WHITE TRUCK",0, 1)# Create RED_TO_WHITE:# Assign 0 if both FLEET_TYPE and NEW_FLEET are "RED TRUCK", otherwise assign 1full_data_customer$RED_TO_WHITE <-ifelse( full_data_customer$FLEET_TYPE =="RED TRUCK"& full_data_customer$NEW_FLEET =="RED TRUCK",0, 1)full_data_customer$CHANGED_FLEET <-ifelse( full_data_customer$FLEET_TYPE != full_data_customer$NEW_FLEET,"Yes", "No")# Create fleet transition categoriesfleet_change_summary <- full_data_customer %>%mutate(FLEET_STATUS =case_when( FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="RED TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="RED TRUCK"~"Stayed Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Stayed White",TRUE~"Other" )) %>%filter(FLEET_STATUS !="Other") %>%mutate(FLEET_STATUS =factor(FLEET_STATUS, levels =c("Stayed White", "Stayed Red", "Changed Fleet"))) %>%group_by(FLEET_STATUS) %>%summarise(Num_Customers =n(), .groups ="drop") %>%mutate(Percentage = Num_Customers /sum(Num_Customers) *100,Label =paste0(Num_Customers, " (", round(Percentage, 1), "%)"))# Custom colorsfleet_colors <-c("Stayed Red"="#B33951","Stayed White"="#D3D3D3","Changed Fleet"="plum")# Plot with value and percentage labelsggplot(fleet_change_summary, aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +labs(title ="Number of Customers by Fleet Type (400 gal X New Recommendation)",x ="",y ="Number of Customers" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold"),panel.grid.major =element_blank(),panel.grid.minor =element_blank() )
Among all customers, compared to the 400-gallon threshold, 14 percent (4,325) would have their fleet assignment changed, either from red truck to white truck or vice versa.
These 4,235 customers represent 9.3% of the total volume sold in 2023 and 2024. Of these, 2,461 would switch from white trucks to red trucks (20% of the white truck volume), while 1,774 would switch from red trucks to white trucks (7.4% of the red truck volume).
Local Market Partners - Local Fountain Only
Code
# Create fleet transition categories for LOCAL_FOUNT_ONLY customersfleet_change_summary <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%# Filter only LOCAL_FOUNT_ONLY customersmutate(FLEET_STATUS =case_when( FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="RED TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="RED TRUCK"~"Stayed Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Stayed White",TRUE~"Other" )) %>%filter(FLEET_STATUS !="Other") %>%mutate(FLEET_STATUS =factor(FLEET_STATUS, levels =c("Stayed White", "Stayed Red", "Changed Fleet"))) %>%group_by(FLEET_STATUS) %>%summarise(Num_Customers =n(), .groups ="drop") %>%mutate(Percentage = Num_Customers /sum(Num_Customers) *100,Label =paste0(Num_Customers, " (", round(Percentage, 1), "%)"))# Custom colorsfleet_colors <-c("Stayed Red"="#B33951","Stayed White"="#D3D3D3","Changed Fleet"="plum")# Plot with value and percentage labelsggplot(fleet_change_summary, aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +# Adjust width heregeom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +labs(title ="LFO Number of Customers (400 gal X New Recommendation)",x ="",y ="Number of Customers" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold"),panel.grid.major =element_blank(), # Remove major grid linespanel.grid.minor =element_blank() # Remove minor grid lines )
Among local market partners, 148 customers (11%) would switch fleets, making up 25% of the group’s total volume. Of these, 95 switched from red trucks to white trucks, which is 52% of red truck customers and 37% of the red truck volume in this group.
Additionally, 53 customers switched from white trucks to red trucks, representing 4.5% of white truck customers and 9% of the white truck volume in this group.
Impacts on Chain Members
Code
# Define levels to control the bar orderfleet_status_levels <-c("Stayed Red", "Red to White", "Stayed White", "White to Red")# Prepare the data with CHAIN_MEMBER includedfleet_change_data <- full_data_customer %>%mutate(FLEET_ORIGIN =case_when( FLEET_TYPE =="RED TRUCK"~"RED_TO_WHITE", FLEET_TYPE =="WHITE TRUCK"~"WHITE_TO_RED",TRUE~"Other" ),FLEET_STATUS =case_when( FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Red to White", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="RED TRUCK"~"Stayed Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="RED TRUCK"~"White to Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Stayed White",TRUE~"Other" ),FLEET_STATUS =factor(FLEET_STATUS, levels = fleet_status_levels),CHAIN_MEMBER =as.factor(CHAIN_MEMBER) )# Summarize with CHAIN_MEMBERfleet_change_summary <- fleet_change_data %>%filter(FLEET_ORIGIN %in%c("RED_TO_WHITE", "WHITE_TO_RED")) %>%group_by(FLEET_ORIGIN, CHAIN_MEMBER, FLEET_STATUS) %>%summarise(Num_Customers =n(), .groups ="drop") %>%group_by(FLEET_ORIGIN, CHAIN_MEMBER) %>%mutate(Percentage = Num_Customers /sum(Num_Customers) *100,Label =paste0(Num_Customers, " (", round(Percentage, 1), "%)") )# Define custom colorsfleet_colors <-c("Red to White"="#D3D3D3","White to Red"="#B33951","Stayed Red"="#B33951","Stayed White"="#D3D3D3")# Plot functionplot_fleet_change <-function(origin) {ggplot(fleet_change_summary %>%filter(FLEET_ORIGIN == origin), aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +facet_wrap(~ CHAIN_MEMBER, labeller = label_both) +scale_y_continuous(limits =c(0, 15000)) +# Y scale set to 0–15000labs(title =paste("Fleet Transition:", gsub("_", " ", origin)),x ="",y ="Number of Customers" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold"),panel.grid.major.x =element_blank(), # Remove vertical grid linespanel.grid.minor.x =element_blank() )}# Plotsplot_fleet_change("RED_TO_WHITE")
Code
plot_fleet_change("WHITE_TO_RED")
Among customers who are chain members (CHAIN_MEMBER = 1) and who, based on the 400-gallon threshold, should be served by red trucks, 17% would now be served by white trucks instead. This shift raises the question of whether there could be a negative impact due to the inconsistent service model within the same customer group.
In parallel, 16% of customers who should be served by white trucks under the same threshold would now be served by red trucks. This inversion in fleet assignment suggests a possible misalignment with the intended operational segmentation, and should be further evaluated to ensure customer experience and operational efficiency are not compromised.
10.5 Impact on Customer Segments (clusters)
Below is the visualization of customers by cluster who would change their fleet assignment based on their consumption and number of orders.
Code
# Define custom colors for the fleet and clusterspalette_fleet <-c("RED TRUCK"="#B33951", # Red truck"WHITE TRUCK"="#D3D3D3"# White truck)palette_clusters <-c("Cluster 1"="#FF6347", # Coral"Cluster 2"="#4682B4", # Cornflower blue"Cluster 3"="#FFD700"# Yellow)# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Filter only customers with CHANGED_FLEET == "Yes"changed_fleet_data <- full_data_customer %>%filter(CHANGED_FLEET =="Yes") %>%filter(NEW_FLEET %in%c("RED TRUCK", "WHITE TRUCK")) %>%mutate(Fleet_Type = NEW_FLEET)# Define a custom labeller for the clusterscustom_labeller <-labeller(CLUSTER =c("1"="Cluster 1","2"="Cluster 2","3"="Cluster 3" ))# Create scatter plot with log scales and no background color for facet labelsggplot(changed_fleet_data) +geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = Fleet_Type),alpha =0.5, width =0.2) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +# Log scale for x-axis with specific limitsscale_y_log10(limits =c(10, 1000000),breaks =c(10, 100, 1000, 10000, 100000, 1000000),labels = scales::comma ) +scale_color_manual(values = palette_fleet) +scale_linetype_manual(values ="solid", name ="") +labs(title ="Customers who changed truck assignments by Cluster" ,x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ CLUSTER + Fleet_Type, scales ="fixed", nrow =1, labeller = custom_labeller) +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =9), # Adjust text size for facet labelsstrip.background =element_blank(), # Remove background color from facet labelspanel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.minor =element_blank(),panel.background =element_rect(fill ="white", color ="white"),legend.position ="bottom",legend.box ="vertical" ) +guides(color ="none") # Remove legend for Fleet_Type
Out of the 425 customers who would change their fleet assignment:
These customers represent 9.3% of the total volume.
Cluster Breakdown:
1,273 customers from Cluster 1 will now be served by red trucks.
1,188 customers from Cluster 2 switched from white trucks to red trucks, reflecting high potential, recency, and order frequency.
Additionally:
1,748 customers from Cluster 2 switched from red trucks to white trucks.
26 customers from Cluster 3 switched from red trucks to white trucks.
Customer Segmentation and Cold Drink Channel
Among the customers who would change fleet assignments, the majority belong to the Dining segment (52%), where 962 would switch from red trucks to white trucks, and 1,229 would switch from white trucks to red trucks.
The second-largest segment with changes is GOODS (19%), where 251 customers would switch from red trucks to white trucks, and 549 would switch from white trucks to red trucks.
The EVENT segment (9%) would have 226 customers switching from red trucks to white trucks, while 159 would switch from white trucks to red trucks.
Code
# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Define colors for NEW_FLEETfleet_colors <-c("RED TRUCK"="#B33951","WHITE TRUCK"="#D3D3D3")# Filter data for changed fleetfiltered_data <- full_data_customer %>%filter(CHANGED_FLEET =="Yes")# Calculate the number of unique customersunique_customers <- filtered_data %>%summarise(unique_customers =n_distinct(CUSTOMER_NUMBER))#print(unique_customers)# Create scatter plot colored by NEW_FLEETggplot(filtered_data) +geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = NEW_FLEET),alpha =0.5, width =0.2) +geom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +scale_y_log10(limits =c(10, 100000),breaks =c(10, 100, 1000, 10000, 100000),labels = scales::comma ) +scale_color_manual(values = fleet_colors) +scale_linetype_manual(values ="solid", name =NULL) +labs(title ="Customers Who Changed Fleet Assignment",subtitle ="Average Annual Consumption vs. Number of Orders by Cold Drink Channel",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ COLD_DRINK_CHANNEL, scales ="fixed") +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =10),panel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.minor =element_blank(),panel.background =element_rect(fill ="white", color ="white"),legend.position ="bottom" )
Code
# Define colors based on NEW_FLEETfleet_colors <-c("RED TRUCK"="#B33951", # Red"WHITE TRUCK"="#D3D3D3"# Light gray)# Filter the data for LOCAL_FOUNT_ONLY == 1 and CHANGED_FLEET == "Yes"filtered_local_fount <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1, CHANGED_FLEET =="Yes")# Threshold line for 400 gallonsthreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Updated plotggplot(filtered_local_fount) +geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = NEW_FLEET),alpha =0.6, width =0.2) +geom_line(data = threshold_line,aes(x = x, y = y, linetype = type),color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +scale_y_log10(limits =c(10, 100000),breaks =c(10, 100, 1000, 10000, 100000),labels = scales::comma ) +scale_color_manual(values = fleet_colors) +scale_linetype_manual(values ="solid", name =NULL) +labs(title ="Local Fountain Only Customers Who Changed Fleet Assignment",subtitle ="Average Annual Consumption vs. Number of Orders by Cold Drink Channel",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)",color ="New Fleet" ) +facet_wrap(~ COLD_DRINK_CHANNEL, scales ="fixed") +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =10),panel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.minor =element_blank(),panel.background =element_rect(fill ="white", color ="white"),legend.position ="bottom" )
Code
# List all variables in the environmentall_vars <-ls()# Exclude 'full_data', 'full_data_customer', and the new variables from removalvars_to_keep <-c("full_data", "full_data_customer","mydir", "one_seed", "reference_date")# Get the variables to removevars_to_remove <-setdiff(all_vars, vars_to_keep)# Remove the temporary data framesrm(list = vars_to_remove)# Clean up by removing 'all_vars' and 'vars_to_remove'rm(all_vars, vars_to_remove)
Among the local market partners with fountain drink only, nearly 90% of the fleet changes would occur in the Dining segment. In this group, 85 customers would switch from white trucks to red trucks, and 45 would switch from red trucks to white trucks.
11. Business Value and Final Conclusions
The proposed fleet reassignment strategy has the potential to save approximately $770,000 for the company over the past two years by increasing the number of customers served by red trucks, optimizing their usage frequency, and reducing their volume by 3%, which would allow for the eventual redeployment to strategic customers.
The proposal was quite conservative, redesigning the delivery method for only 14% of the customers and was able to assign the fleet not only based on volume but on several intrinsic customer characteristics. Therefore, the expectation is that after its implementation, there will be gains not only in cost reduction but also in sales increase, mainly for customers with greater growth potential. In addition, the proposal allowed the identification of three main customer groups, two of which showed good homogeneity.
When measuring the impacts of the new fleet assignment, the dining segment was the most impacted by these changes, particularly for the local market partners classified as fountain only. There was no significant impact on the activities of sales representatives, but there was a significant impact in reducing the volumes delivered by red trucks (-20%) when orders are placed through call centers, which is actually a good outcome since orders through call centers no longer had a strong relationship with customers.
A differentiator for the process was the feature engineering, which brought robustness to the clustering. The supervised models, Decision Tree and Multinomial Logistic Regression, were very important in explaining the variables that influenced the clusters and, with their accuracy being raised (close to 90%), they have the potential to predict segments for new customers.
Limitations, Improvements, and Lessons
One of the main limitations of this project was the short two-year historical data, which made it difficult to predict the future impact of the recommendation. Analytical approaches were challenging due to the wide probability ranges, meaning that any outcome was possible.
Another challenge was the asynchrony between customer orders, which made it hard to track individual customer growth tied to specific times of the year. With a longer historical series, we could have made more accurate future predictions.
The census data could have been better utilized. The way it was applied in this project didn’t deliver the expected results, but with adjustments and more historical data, it could provide valuable insights for future analysis.
It’s clear that predicting future growth, even with extensive data, is a complex task. These predictions should only be emphasized if the process is robust, with strong statistical support and a consistent range of possible outcomes. Otherwise, it might be better to refrain from highlighting them.
Looking ahead, I strongly recommend conducting further tests to measure the impact of fleet allocation and the way customers place orders. This will be crucial in validating or refining the current approach. Additionally, analyzing revenue could provide deeper business insights, especially in understanding margins across different customer segments.
A key takeaway from this project is that data doesn’t always provide all the answers we need for decision-making. In these cases, history shows that there will be both successes and setbacks, but decisions still need to be made. My role was to make responsible recommendations and take a clear stance, even when faced with uncertainties.
Source Code
---title: "Delivery Standardization - MSBA Capstone 2025"author: "Kleyton R. Polzonoff"date: todayformat: html: toc: true toc_float: true # Enable floating TOC with automatic highlighting self-contained: true # HTML only, no folders code-tools: true code-fold: TRUE code-summary: "Show the code"execute: echo: true eval: true message: false warning: falseeditor: markdown: wrap: sentence---```{r setup, include=FALSE}library(rmarkdown); library(tidyverse); library(knitr); library(matrixStats); library(ROSE); library(DMwR2); library(randomForest); library(C50); library(clusterSim); library(fpc); library(fossil); library(cvTools); library(cluster); library(flexclust); library(mclust); library(viridis); library(ggplot2); library(rpart); library(RWeka); library(kernlab); library(MLmetrics); library(smotefamily); library(caret); library(rminer); library(tictoc); library(rpart.plot); library(psych); library(ROCR); library(glmnet); library(car); library(reshape2); library(ggcorrplot); library(dplyr); library(tidyr); library(grf); library(kableExtra); library(formattable); library(rattle); library(factoextra); library(corrplot); library(dataPreparation); library(scales); library(DT); library(lubridate); library(data.table); library(glue); library(readxl); library(janitor); library(RColorBrewer); library(tidycensus); library(doParallel); library(vcd); library(scales); library(nnet);library(gt)#library(sf)one_seed<-500# Suppress warnings globallyoptions(warn =-1)# Reference Datereference_date <-as.Date("2025-01-01")mydir <-getwd()setwd(mydir)```**Important Note** To display the code, click the **"Code"** button in the body of the document or click the **\</\> Code** button at the top right, then select **"Show All Code."**------------------## 1. Business Problem Statement and ObjectivesThe client, a major beverage supplier, needs a structured system to optimize logistics between its own fleet of Red Trucks and alternative delivery methods (ARTM), which include partner trucks and third-party carriers known as White Trucks. Red Trucks enhance customer relationships and contribute to revenue, while ARTM offers flexibility but limits interaction and control.To ensure high-quality service and cost efficiency, I will establish clear fleet allocation guidelines based on customer profiles, transaction data, addresses, and delivery costs. This approach will determine the optimal truck type for each customer using a well-defined annual volume threshold. Additionally, customer segmentation will identify shared characteristics, enabling more strategic and data-driven decision-making.Based on these insights, I will provide actionable recommendations to optimize fleet allocation and enhance operational efficiency.## 2. Analytical Approach and DeliveriesThe analysis will be conducted separately for two customer groups:- **All Customers** – The broader customer base, including those who purchase various product types.- **Local Market Partners Buying Fountain Only** – Customers who purchase only fountain drinks, excluding CO2, cans, or bottles.To address logistics challenges and transform decisions into data-driven solutions, the approach will combine predictive models with clustering techniques, using both supervised and unsupervised learning methods to build a structured and efficient logistics framework.### Supervised LearningSupervised learning techniques will be applied to determine whether each customer should be served by Red Trucks (our own fleet) or White Trucks (ARTM), based on a defined set of criteria.**Refining the Fleet Allocation Strategy: Initial Assumptions**As the dataset does not provide predefined fleet allocation criteria, I will establish initial reference points to guide this analysis:- Annual Volume Threshold: Customers receiving 400 cases and/or gallons per year will initially be assigned to the Red Truck fleet, while those below this threshold will be served by White Trucks.### Unsupervised Learning - Customer SegmentationA clustering analysis will identify customer groups with similar consumption patterns, refining fleet allocation and enhancing decision-making rules.### Cost Impact AnalysisConsidering that the delivery cost of white trucks is five times lower than that of red trucks, different logistics scenarios will be analyzed to compare the costs and strategic impacts of the current approach with the final recommended strategy.### RecommendationsBased on the analysis, data-driven recommendations will be provided to optimize fleet allocation, with a focus on improving service quality, cost efficiency, and strategic decision-making. This approach ensures that each customer receives the most suitable delivery method.### Description of the DataThis project will use four data files provided by the company: 1. customer_profile.csv 2. transactional_data.csv 3. customer_address_and_zip_mapping.csv 4. delivery_cost_data.xlsx- The customer profile data includes information on all customers they deliver to. This file contains each customer's unique ID along with various categorical variables that describe their location, industry, and delivery preferences.- The transactional data contains all transactions from all customers with the ordered and delivered amount of product measured in cases and gallons.- The customer address file only contains two columns – zip code and full address. This can be used in tandem with the customer profile data.- The delivery cost data maps the cost of delivering a product based on different criteria. This will be used with the transaction data to find the cost of each transaction.## 3 Exploratory Data Analysis (EDA) - Part IThis section analyzes the provided data to identify solutions, with a focus on completeness, consistency, and potential issues. Data transformations may include the creation of new variables to improve model accuracy. Given the large number of variables, the most relevant ones will be prioritized to ensure clarity, while less relevant analyses will be excluded to avoid information overload.### 3.1 Loading and Cleaning Datasets```{r}# Load the profile CSVprofile_data <-read.csv(file ="customer_profile.csv", sep =",", stringsAsFactors =FALSE)# Load the address CSVcustomer_address <-read.csv(file ="customer_address_and_zip_mapping.csv", sep =",", stringsAsFactors =FALSE)# Load the transactional CSVop_data <-read.csv(file ="transactional_data.csv", sep =",", stringsAsFactors =FALSE)```Missing data assessments and any substitutions or modifications will be carried out and will be included in the provided R Markdown file. However, some of these actions will not be displayed in this report to avoid content overload.**Profile Dataset - Cleaning and Adjustments**```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# 1. Customer Profile Dataset cleaning and basic transformations# Data Structurestr(profile_data)# Check the number of unique CUSTOMER_NUMBER in profile_data and op_dataprint(length(unique(profile_data$CUSTOMER_NUMBER)))print(length(unique(op_data$CUSTOMER_NUMBER)))# Convert the date columns to Date formatprofile_data$FIRST_DELIVERY_DATE <-as.Date(profile_data$FIRST_DELIVERY_DATE, format ="%m/%d/%Y")profile_data$ON_BOARDING_DATE <-as.Date(profile_data$ON_BOARDING_DATE, format ="%m/%d/%Y")# Convert logical to integer for LOCAL_MARKET_PARTNERprofile_data$LOCAL_MARKET_PARTNER <-as.integer(profile_data$LOCAL_MARKET_PARTNER)# Convert logical to integer for CO2_CUSTOMERprofile_data$CO2_CUSTOMER <-as.integer(profile_data$CO2_CUSTOMER)# Convert all character columns to factorsprofile_data[sapply(profile_data, is.character)] <-lapply(profile_data[sapply(profile_data, is.character)], as.factor)# Summarysummary(profile_data)# Remove parentheses, any "-" or "&", then trim extra spacesprofile_data$COLD_DRINK_CHANNEL <-as.factor(gsub("\\s+", " ", gsub("[(]|[)]|[-]|[&]", " ", as.character(profile_data$COLD_DRINK_CHANNEL))))profile_data$TRADE_CHANNEL <-as.factor(gsub("\\s+", " ", gsub("[(]|[)]|[-]|[&]", " ", as.character(profile_data$TRADE_CHANNEL))))profile_data$SUB_TRADE_CHANNEL <-as.factor(gsub("\\s+", " ", gsub("[(]|[)]|[-]|[&]", " ", as.character(profile_data$SUB_TRADE_CHANNEL))))# Verify the changes in the summarysummary(profile_data)# Display the count of NA values for each variablecat("NA count for each variable:\n")sapply(profile_data, function(x) sum(is.na(x)))# Display the count of NULL values for each variablecat("NULL count for each variable:\n")sapply(profile_data, function(x) sum(is.null(x)))# Replace NA values in PRIMARY_GROUP_NUMBER with zeroprofile_data$PRIMARY_GROUP_NUMBER[is.na(profile_data$PRIMARY_GROUP_NUMBER)] <-0# Display the count of duplicate 'CUSTOMER_NUMBER' valuescat("Number of duplicate CUSTOMER_NUMBER values:", sum(duplicated(profile_data$CUSTOMER_NUMBER)), "\n")# Create CHAIN_MEMBER column: 0 for non-members, 1 for membersprofile_data$CHAIN_MEMBER <-as.integer(profile_data$PRIMARY_GROUP_NUMBER !=0)```- The number of unique CUSTOMER_NUMBER in profile_data is greater than in transactional data. This will be addressed later before merging the datasets.- There are no duplicates or missing values for CUSTOMER_NUMBER.- Date variables were adjusted to the proper format. - Logical variables were converted to integers, where 0 represents false and 1 represents true.- Special characters and extra spaces were removed in factor variables.- Missing values in the PRIMARY_GROUP_NUMBER field were replaced with zero. - The CHAIN_MEMBER variable was created to indicate whether the outlet belongs to a chain (has a PRIMARY_GROUP_NUMBER). A value of 1 represents a member, and 0 represents a non-member.**Customer Address Dataset - Cleaning and Adjustments**```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# 2. Customer Address Dataset cleaning and basic transformations# Data Structurestr(customer_address)# Summarysummary(customer_address)# Display the count of NA values for each variablecat("NA count for each variable:\n")sapply(customer_address, function(x) sum(is.na(x)))# Display the count of NULL values for each variablecat("NULL count for each variable:\n")sapply(customer_address, function(x) sum(is.null(x)))# Count duplicate rows based on 'zip' and 'full.address'cat("Number of duplicate rows based on 'zip':", sum(duplicated(customer_address$zip)), "\n")cat("Number of duplicate rows based on 'full.address':", sum(duplicated(customer_address$full.address)), "\n")# Split the 'full.address' column into separate componentsseparated_columns <-strsplit(customer_address$full.address, ",")# Modify the existing customer_address data frame with the separated columns in uppercase and appropriate data typescustomer_address$ZIP <-as.integer(customer_address$zip) customer_address$CITY <-as.factor(toupper(sapply(separated_columns, `[`, 2)))customer_address$STATE <-as.factor(toupper(sapply(separated_columns, `[`, 4)))customer_address$COUNTY <-as.factor(toupper(sapply(separated_columns, `[`, 5)))customer_address$REGION <-as.integer(toupper(sapply(separated_columns, `[`, 6)))customer_address$LATITUDE <-as.numeric(sapply(separated_columns, `[`, 7))customer_address$LONGITUDE <-as.numeric(sapply(separated_columns, `[`, 8))# Remove the original full.address columncustomer_address$full.address <-NULLcustomer_address$zip <-NULL# Remove separated_columns variable to clean up memoryrm(separated_columns)# Find duplicate rows based on latitude and longitudecustomer_address %>%filter(duplicated(across(c(LATITUDE, LONGITUDE))) |duplicated(across(c(LATITUDE, LONGITUDE)), fromLast =TRUE)) %>%arrange(LATITUDE)# Example:#40574 Lexington KY Fayette 67 38.0283 -84.4715#40575 Lexington KY Fayette 67 38.0283 -84.4715#40576 Lexington KY Fayette 67 38.0283 -84.4715```- The address was split into new columns for each component. - The dataset does not contain customers' actual addresses but will be used for data aggregation to support customer segmentation. It includes 145 rows with identical geographic coordinates; however, no ZIP codes are duplicated.**Transactional Dataset - Cleaning and Adjustments**```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# 3. Transactional Dataset cleaning and basic transformations# Data Structurestr(op_data)# Transaction date formatop_data$TRANSACTION_DATE <-as.Date(op_data$TRANSACTION_DATE, format ="%m/%d/%Y")# Convert the ORDER_TYPE column to a factorop_data$ORDER_TYPE <-as.factor(op_data$ORDER_TYPE)# Summarysummary(op_data)# Display the count of NA values for each variablecat("NA count for each variable:\n")sapply(op_data, function(x) sum(is.na(x)))# Order type - convert nulls in OTHERsop_data$ORDER_TYPE <-replace(op_data$ORDER_TYPE, op_data$ORDER_TYPE =="null", "OTHER")# Days after the transaction by reference dateop_data$DAYS_AFTER <-as.Date(reference_date) - op_data$TRANSACTION_DATE# Convert DAYS_AFTER to numericop_data$DAYS_AFTER <-as.numeric(op_data$DAYS_AFTER)# Transactions with no values for cases and gallons# Filter the rows where all the values are zeroqtd_check <- op_data %>%filter(ORDERED_CASES ==0& LOADED_CASES ==0& DELIVERED_CASES ==0& ORDERED_GALLONS ==0& LOADED_GALLONS ==0& DELIVERED_GALLONS ==0)# Display the filtered rows interactively with DTdatatable(qtd_check, options =list(pageLength =10, autoWidth =TRUE))# Remove rows where all case and gallon values are zeroop_data <- op_data %>%filter(!(ORDERED_CASES ==0& LOADED_CASES ==0& DELIVERED_CASES ==0& ORDERED_GALLONS ==0& LOADED_GALLONS ==0& DELIVERED_GALLONS ==0))# Negative Cases Deliveries# Creating RETURNED_CASES column based on DELIVERED_CASES values op_data$RETURNED_CASES <-ifelse(op_data$DELIVERED_CASES <0, -op_data$DELIVERED_CASES, 0)# Replacing negative values in DELIVERED_CASES with 0 op_data$DELIVERED_CASES[op_data$DELIVERED_CASES <0] <-0# Negative Gallons Deliveries# Creating RETURNED_CASES column based on DELIVERED_GALLONS values op_data$RETURNED_GALLONS <-ifelse(op_data$DELIVERED_GALLONS <0, -op_data$DELIVERED_GALLONS, 0)# Replacing negative values in DELIVERED_GALLONS with 0 op_data$DELIVERED_GALLONS[op_data$DELIVERED_GALLONS <0] <-0# Transactions with no values for Delivered or returned cases and gallons# Filter the rows where all the values are zeroqtd_check <- op_data %>%filter(DELIVERED_CASES ==0& RETURNED_CASES ==0& DELIVERED_GALLONS ==0& RETURNED_GALLONS ==0)# Display the filtered rows interactively with DTdatatable(qtd_check, options =list(pageLength =10, autoWidth =TRUE))# Classifying transactions based on delivery and return informationop_data <- op_data %>%mutate(DLV_TYPE =case_when( DELIVERED_CASES >0& DELIVERED_GALLONS ==0~"CASES", # Delivered cases, no gallons DELIVERED_GALLONS >0& DELIVERED_CASES ==0~"GALLONS", # Delivered gallons, no cases DELIVERED_CASES >0& DELIVERED_GALLONS >0~"BOTH", # Delivered both cases and gallons RETURNED_CASES >0& RETURNED_GALLONS ==0~"RETURN_CASES", # Returned cases, no gallons RETURNED_GALLONS >0& RETURNED_CASES ==0~"RETURN_GALLONS", # Returned gallons, no cases RETURNED_CASES >0& RETURNED_GALLONS >0~"RETURN_BOTH", # Returned both cases and gallonsTRUE~"ORDER_LOAD")) %>%# Order and/or load transactionsmutate(DLV_TYPE =factor(DLV_TYPE)) # Convert to factor# Remove temporary variables and data framesrm(qtd_check)# Summary DLV_TYPEsummary(op_data$DLV_TYPE)```- 11,131 null values in the ORDER_TYPE column were replaced with "OTHER."- The DAYS_AFTER column was added to track the number of days since the transaction, up to February 2, 2025.- 483 rows with zero values in ORDERED, LOADED, and DELIVERED CASES and GALLONS will be removed from the dataset.- Negative values in DELIVERED_CASES and DELIVERED_GALLONS have been moved to new columns (RETURNED_CASES and RETURNED_GALLONS), and the original columns were set to zero.- 30,965 transactions are related to order and/or load but do not have delivery or return data. These will be classified as "order_load" in the DLV_TYPE column.### 3.2 Combined Dataset Driven by TransactionsDuring the exploration, combining all available data was identified as the most effective approach for subsequent analyses. Two files were created: one preserving individual transactions and another compiling information by customer. Both will be used in the exploratory data analysis.The profile data contains exactly 1801 unique ZIP codes, which were merged with the same number of unique ZIP codes from the customer address dataset. It is important to note that some ZIP codes share the same geographic coordinates, reducing reliability in those cases.As previously mentioned, the number of unique customer numbers in the profile data (now referred to as full data) is greater than in the transactions dataset. Only customers present in the transactions dataset were included in the merged data.```{r, results='hide'}# Merge customer_address with profile_data using ZIP_CODEfull_data <- profile_data %>%left_join(customer_address, by =c("ZIP_CODE"="ZIP"))# Check the number of unique CUSTOMER_NUMBER in full_data and op_dataprint(length(unique(full_data$CUSTOMER_NUMBER)))print(length(unique(op_data$CUSTOMER_NUMBER)))# Filter full_data to keep only CUSTOMER_NUMBERs that are also in op_data, and merge with op_datafull_data <- full_data %>%filter(CUSTOMER_NUMBER %in% op_data$CUSTOMER_NUMBER) %>%left_join(op_data, by ="CUSTOMER_NUMBER")```Below are the first 5 rows and 6 columns of the combined dataset.```{r}# Display the first few rows of the combined datasethead(full_data[, 1:6], 5)```The variable LOCAL_FOUNT_ONLY will be created to identify whether the transaction's customer belongs to the "Local Market Partners Buying Fountain Only" group—customers who purchase only fountain drinks, excluding CO2, cans, or bottles. It will be assigned a value of 1 if the customer belongs to this group and 0 otherwise.```{r}# Aggregate total delivered cases and gallons per customercustomer_summary <- full_data %>%group_by(CUSTOMER_NUMBER) %>%summarise(TOTAL_DELIVERED_CASES =sum(DELIVERED_CASES),TOTAL_DELIVERED_GALLONS =sum(DELIVERED_GALLONS),LOCAL_MARKET_PARTNER =max(LOCAL_MARKET_PARTNER),CO2_CUSTOMER =max(CO2_CUSTOMER),.groups ="drop")# Classify customers based on aggregated valuescustomer_summary <- customer_summary %>%mutate(LOCAL_FOUNT_ONLY =case_when( LOCAL_MARKET_PARTNER ==1& CO2_CUSTOMER ==0& TOTAL_DELIVERED_GALLONS >0& TOTAL_DELIVERED_CASES ==0~ 1L,TRUE~ 0L))# Merge back to original datafull_data <- full_data %>%left_join(dplyr::select(customer_summary, CUSTOMER_NUMBER, LOCAL_FOUNT_ONLY), by ="CUSTOMER_NUMBER")# Remove temporary variables and data framesrm(customer_summary)```The code below will create a table for an initial overview of the customer types.```{r}# Aggregate data by LOCAL_FOUNT_ONLYsummary_data <- full_data %>%group_by(LOCAL_FOUNT_ONLY) %>%summarise(customers =n_distinct(CUSTOMER_NUMBER), # Count unique customerstransactions =n(), # Count transactions for this groupqtd_cas =sum(DELIVERED_CASES), # Total delivered casesqtd_gal =sum(DELIVERED_GALLONS), # Total delivered gallonstotal_qtd =sum(DELIVERED_CASES) +sum(DELIVERED_GALLONS), # Total volume (cases + gallons).groups ="drop" ) %>%mutate(pct_cust = customers /sum(customers) *100, pct_trans = transactions /nrow(full_data) *100, pct_qtd = total_qtd /sum(total_qtd) *100, pct_gal = qtd_gal /sum(qtd_gal) *100 ) %>%rename(LFO = LOCAL_FOUNT_ONLY) %>%# Rename the columnmutate(# Formatting numbers with comma separator, rounding before formatting with commascustomers =format(customers, big.mark =",", scientific =FALSE),transactions =format(transactions, big.mark =",", scientific =FALSE),qtd_cas =format(round(qtd_cas, 0), big.mark =",", scientific =FALSE), qtd_gal =format(round(qtd_gal, 0), big.mark =",", scientific =FALSE), total_qtd =format(round(total_qtd, 0), big.mark =",", scientific =FALSE), pct_cust =round(pct_cust, 1),pct_trans =round(pct_trans, 1),pct_qtd =round(pct_qtd, 1),pct_gal =round(pct_gal, 1) )# Add the total row (LFO = "Total")total_row <- summary_data %>%summarise(LFO ="Total",customers =sum(as.numeric(gsub(",", "", customers))),transactions =sum(as.numeric(gsub(",", "", transactions))),qtd_cas =sum(as.numeric(gsub(",", "", qtd_cas))),qtd_gal =sum(as.numeric(gsub(",", "", qtd_gal))),total_qtd =sum(as.numeric(gsub(",", "", total_qtd))),pct_cust =100,pct_trans =100,pct_qtd =100,pct_gal =100 ) %>%mutate(customers =format(customers, big.mark =",", scientific =FALSE),transactions =format(transactions, big.mark =",", scientific =FALSE),qtd_cas =format(round(qtd_cas, 0), big.mark =",", scientific =FALSE),qtd_gal =format(round(qtd_gal, 0), big.mark =",", scientific =FALSE),total_qtd =format(round(total_qtd, 0), big.mark =",", scientific =FALSE),pct_cust =round(pct_cust, 1),pct_trans =round(pct_trans, 1),pct_qtd =round(pct_qtd, 1),pct_gal =round(pct_gal, 1) )# Convert LFO to character to ensure consistencysummary_data <- summary_data %>%mutate(LFO =as.character(LFO))total_row <- total_row %>%mutate(LFO =as.character(LFO))# Combine the summary data with the total rowcombined_data <-bind_rows(summary_data, total_row)# Reorder the columnscombined_data <- combined_data[, c("LFO", "customers", "pct_cust", "transactions", "pct_trans", "qtd_cas", "qtd_gal", "pct_gal", "total_qtd", "pct_qtd")]# Create the combined tablecombined_data %>%kable("html", escape =FALSE, align ="c") %>%kable_styling(full_width = F, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:10, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="lightgray") %>%# Light gray headeradd_header_above(c("Local Market Partners Fountain Only (LFO) - Delivery Quantities Overview"=10)) %>%kable_paper("striped", full_width = F)# Remove temporary variables and data framesrm(summary_data, total_row, combined_data)``````{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Filter the rows where LOCAL_FOUNT_ONLY == 1 and calculate the sum of DELIVERED_CASES, RETURNED_CASES, ORDERED_CASES and the count of CO2_CUSTOMERfull_data %>%filter(LOCAL_FOUNT_ONLY ==1) %>%summarise(total_delivered_cases =sum(DELIVERED_CASES, na.rm =TRUE),total_returned_cases =sum(RETURNED_CASES, na.rm =TRUE),total_ordered_cases =sum(ORDERED_CASES, na.rm =TRUE),co2_customer_count =sum(CO2_CUSTOMER ==1, na.rm =TRUE))# Filter rows where LOCAL_FOUNT_ONLY == 1, and both RETURNED_CASES and ORDERED_CASES are greater than 0full_data %>%filter(LOCAL_FOUNT_ONLY ==1, ORDERED_CASES >0)```Only 4.5% of customers are Local Market Partners who do not purchase CO2 and buy only fountain drinks (LFO = 1), accounting for 3% of transactions. They consumed 5.9% of delivered gallons but represent just 1.9% of the total volume (cases + gallons).This small group of 1,359 customers includes 83 transactions with positive ordered cases. The last order was placed on December 19, 2024, which would allow for some case deliveries to appear in transactions. Since this didn’t occur, these customers will be classified as part of the LFO group, as they consume fountain drinks (gallons), despite ordering cases.### 3.3 Combined Dataset Driven by OutletsThe information from the combined transaction dataset (`full_data`) will now be merged by customer and named `full_data_customer`. The goal is to create a unique list of customers who have made transactions. This file will contain a large number of columns and will be used for further analysis.```{r}# Creating the YEAR_MONTH column to identify the periodsfull_data <- full_data %>%mutate(YEAR_MONTH =format(as.Date(TRANSACTION_DATE), "%Y_%m"))# Function to count transactions by periodcount_transactions <-function(df, value_column, prefix) { df %>%group_by(CUSTOMER_NUMBER, YEAR_MONTH) %>%summarise(value_count =sum(!!sym(value_column) >0, na.rm =TRUE), .groups ="drop") %>%pivot_wider(names_from = YEAR_MONTH, values_from = value_count, names_prefix = prefix, values_fill =list(value_count =0))}# Counting transactions for each metrictrans_ordered_cases <-count_transactions(full_data, "ORDERED_CASES", "TRANS_ORD_CA_")trans_ordered_gallons <-count_transactions(full_data, "ORDERED_GALLONS", "TRANS_ORD_GAL_")trans_delivered_cases <-count_transactions(full_data, "DELIVERED_CASES", "TRANS_DLV_CA_")trans_delivered_gallons <-count_transactions(full_data, "DELIVERED_GALLONS", "TRANS_DLV_GAL_")trans_returned_cases <-count_transactions(full_data, "RETURNED_CASES", "TRANS_RET_CA_")trans_returned_gallons <-count_transactions(full_data, "RETURNED_GALLONS", "TRANS_RET_GAL_")# Function to sum the values by periodsum_transactions <-function(df, value_column, prefix) { df %>%group_by(CUSTOMER_NUMBER, YEAR_MONTH) %>%summarise(value_sum =sum(!!sym(value_column), na.rm =TRUE), .groups ="drop") %>%pivot_wider(names_from = YEAR_MONTH, values_from = value_sum, names_prefix = prefix, values_fill =list(value_sum =0))}# Summing transactions for each metricqtd_ordered_cases <-sum_transactions(full_data, "ORDERED_CASES", "QTD_ORD_CA_")qtd_ordered_gallons <-sum_transactions(full_data, "ORDERED_GALLONS", "QTD_ORD_GAL_")qtd_delivered_cases <-sum_transactions(full_data, "DELIVERED_CASES", "QTD_DLV_CA_")qtd_delivered_gallons <-sum_transactions(full_data, "DELIVERED_GALLONS", "QTD_DLV_GAL_")qtd_returned_cases <-sum_transactions(full_data, "RETURNED_CASES", "QTD_RET_CA_")qtd_returned_gallons <-sum_transactions(full_data, "RETURNED_GALLONS", "QTD_RET_GAL_")# Ensure the columns in column_order are present in full_datacolumn_order <-c("CUSTOMER_NUMBER", "PRIMARY_GROUP_NUMBER", "FREQUENT_ORDER_TYPE", "FIRST_DELIVERY_DATE", "ON_BOARDING_DATE", "LOCAL_FOUNT_ONLY","COLD_DRINK_CHANNEL", "TRADE_CHANNEL", "SUB_TRADE_CHANNEL", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "ZIP_CODE", "CHAIN_MEMBER", "CITY", "STATE", "COUNTY", "REGION", "LATITUDE", "LONGITUDE")# Check if all columns exist in full_datamissing_cols <-setdiff(column_order, colnames(full_data))if (length(missing_cols) >0) {stop("The following columns are missing in full_data: ", paste(missing_cols, collapse =", "))}# Count the number of transactions per customertrans_count <- full_data %>%group_by(CUSTOMER_NUMBER) %>%summarise(TRANSACTIONS_DATE_COUNT =n(), .groups ="drop")# Joining the data with the required columns in the desired orderfull_data_customer <-distinct(full_data[, column_order]) %>%left_join(trans_count, by ="CUSTOMER_NUMBER") %>%left_join(trans_ordered_cases, by ="CUSTOMER_NUMBER") %>%left_join(trans_ordered_gallons, by ="CUSTOMER_NUMBER") %>%left_join(trans_delivered_cases, by ="CUSTOMER_NUMBER") %>%left_join(trans_delivered_gallons, by ="CUSTOMER_NUMBER") %>%left_join(trans_returned_cases, by ="CUSTOMER_NUMBER") %>%left_join(trans_returned_gallons, by ="CUSTOMER_NUMBER") %>%left_join(qtd_ordered_cases, by ="CUSTOMER_NUMBER") %>%left_join(qtd_ordered_gallons, by ="CUSTOMER_NUMBER") %>%left_join(qtd_delivered_cases, by ="CUSTOMER_NUMBER") %>%left_join(qtd_delivered_gallons, by ="CUSTOMER_NUMBER") %>%left_join(qtd_returned_cases, by ="CUSTOMER_NUMBER") %>%left_join(qtd_returned_gallons, by ="CUSTOMER_NUMBER")# Rename Order Typesfull_data <- full_data %>%mutate(ORDER_TYPE = dplyr::recode(ORDER_TYPE, "CALL CENTER"="CALL.CENTER","MYCOKE LEGACY"="MYCOKE.LEGACY","SALES REP"="SALES.REP"))# Count transactions by ORDER_TYPEorder_type_count <- full_data %>%group_by(CUSTOMER_NUMBER, ORDER_TYPE) %>%summarise(order_type_count =n(), .groups ="drop") %>%pivot_wider(names_from = ORDER_TYPE, values_from = order_type_count, names_prefix ="OT_", values_fill =list(order_type_count =0))# Count transactions by DLV_TYPEdlv_type_count <- full_data %>%group_by(CUSTOMER_NUMBER, DLV_TYPE) %>%summarise(dlv_type_count =n(), .groups ="drop") %>%pivot_wider(names_from = DLV_TYPE, values_from = dlv_type_count, names_prefix ="DLVT_", values_fill =list(dlv_type_count =0))# Join with the full_data_customer to ensure ORDER_TYPE and DLV_TYPE columns are addedfull_data_customer <- full_data_customer %>%left_join(order_type_count, by ="CUSTOMER_NUMBER") %>%left_join(dlv_type_count, by ="CUSTOMER_NUMBER")# Adding the requested summary columnsfull_data_customer <- full_data_customer %>%mutate(TOTAL_CASES_ORDERED =rowSums(full_data_customer[, grep("^QTD_ORD_CA_", names(full_data_customer))]),TOTAL_CASES_DELIVERED =rowSums(full_data_customer[, grep("^QTD_DLV_CA_", names(full_data_customer))]),TOTAL_GALLONS_ORDERED =rowSums(full_data_customer[, grep("^QTD_ORD_GAL_", names(full_data_customer))]),TOTAL_GALLONS_DELIVERED =rowSums(full_data_customer[, grep("^QTD_DLV_GAL_", names(full_data_customer))]),TOTAL_CASES_RETURNED =rowSums(full_data_customer[, grep("^QTD_RET_CA_", names(full_data_customer))]),TOTAL_GALLONS_RETURNED =rowSums(full_data_customer[, grep("^QTD_RET_GAL_", names(full_data_customer))]))# Ensuring column orderot_columns <-colnames(order_type_count)[-1]dlvt_columns <-colnames(dlv_type_count)[-1]summary_columns <-c("TOTAL_CASES_ORDERED", "TOTAL_CASES_DELIVERED", "TOTAL_GALLONS_ORDERED", "TOTAL_GALLONS_DELIVERED", "TOTAL_CASES_RETURNED", "TOTAL_GALLONS_RETURNED")transaction_columns <-grep("^TRANS_", colnames(full_data_customer), value =TRUE)quantity_columns <-grep("^QTD_", colnames(full_data_customer), value =TRUE)ordered_columns <-c(column_order, "TRANSACTIONS_DATE_COUNT", ot_columns, dlvt_columns, summary_columns, sort(transaction_columns), sort(quantity_columns))# Reordering full_data_customerfull_data_customer <- full_data_customer[, ordered_columns]# Replacing NAs with 0 in transaction and quantity columnsfull_data_customer[is.na(full_data_customer)] <-0# Extra variables# Define reference dateref_date <-as.Date("2025-02-01")# 1. DAYS_FIRST_DLVfull_data_customer$DAYS_FIRST_DLV <-as.numeric(difftime(ref_date, full_data_customer$FIRST_DELIVERY_DATE, units ="days"))# 2. DAYS_ONBOARDINGfull_data_customer$DAYS_ONBOARDING <-as.numeric(difftime(ref_date, full_data_customer$ON_BOARDING_DATE, units ="days"))# 3. Average transactions per month# Replace NA with 0 for missing transactionsfull_data_customer[is.na(full_data_customer)] <-0# Calculate the average transaction per monthcols_to_average_dlv <-grep("^TRANS_DLV_CA", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_dlv] <-lapply(full_data_customer[cols_to_average_dlv], as.numeric)full_data_customer$AVG_TRANS_DLV_CA_M <-rowMeans(full_data_customer[, cols_to_average_dlv], na.rm =TRUE)cols_to_average_gal <-grep("^TRANS_DLV_GAL", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_gal] <-lapply(full_data_customer[cols_to_average_gal], as.numeric)full_data_customer$AVG_TRANS_DLV_GAL_M <-rowMeans(full_data_customer[, cols_to_average_gal], na.rm =TRUE)cols_to_average_ord_ca <-grep("^TRANS_ORD_CA", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_ord_ca] <-lapply(full_data_customer[cols_to_average_ord_ca], as.numeric)full_data_customer$AVG_TRANS_ORD_CA_M <-rowMeans(full_data_customer[, cols_to_average_ord_ca], na.rm =TRUE)cols_to_average_ord_gal <-grep("^TRANS_ORD_GAL", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_ord_gal] <-lapply(full_data_customer[cols_to_average_ord_gal], as.numeric)full_data_customer$AVG_TRANS_ORD_GAL_M <-rowMeans(full_data_customer[, cols_to_average_ord_gal], na.rm =TRUE)cols_to_average_ret_ca <-grep("^TRANS_RET_CA", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_ret_ca] <-lapply(full_data_customer[cols_to_average_ret_ca], as.numeric)full_data_customer$AVG_TRANS_RET_CA_M <-rowMeans(full_data_customer[, cols_to_average_ret_ca], na.rm =TRUE)cols_to_average_ret_gal <-grep("^TRANS_RET_GAL", names(full_data_customer), value =TRUE)full_data_customer[cols_to_average_ret_gal] <-lapply(full_data_customer[cols_to_average_ret_gal], as.numeric)full_data_customer$AVG_TRANS_RET_GAL_M <-rowMeans(full_data_customer[, cols_to_average_ret_gal], na.rm =TRUE)# 4. Number of transactions per year (sum annual columns)full_data_customer$NUM_TRANS_ORD_CA_23 <-rowSums(full_data_customer[, grep("^TRANS_ORD_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_ORD_CA_24 <-rowSums(full_data_customer[, grep("^TRANS_ORD_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_ORD_GAL_23 <-rowSums(full_data_customer[, grep("^TRANS_ORD_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_ORD_GAL_24 <-rowSums(full_data_customer[, grep("^TRANS_ORD_GAL_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_DLV_CA_23 <-rowSums(full_data_customer[, grep("^TRANS_DLV_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_DLV_CA_24 <-rowSums(full_data_customer[, grep("^TRANS_DLV_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_DLV_GAL_23 <-rowSums(full_data_customer[, grep("^TRANS_DLV_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_DLV_GAL_24 <-rowSums(full_data_customer[, grep("^TRANS_DLV_GAL_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_RET_CA_23 <-rowSums(full_data_customer[, grep("^TRANS_RET_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_RET_CA_24 <-rowSums(full_data_customer[, grep("^TRANS_RET_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_RET_GAL_23 <-rowSums(full_data_customer[, grep("^TRANS_RET_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$NUM_TRANS_RET_GAL_24 <-rowSums(full_data_customer[, grep("^TRANS_RET_GAL_2024", names(full_data_customer))], na.rm =TRUE)# 5. Sum of quantities per yearfull_data_customer$QTD_ORD_CA_2023 <-rowSums(full_data_customer[, grep("^QTD_ORD_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_ORD_GAL_2023 <-rowSums(full_data_customer[, grep("^QTD_ORD_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_ORD_CA_2024 <-rowSums(full_data_customer[, grep("^QTD_ORD_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_ORD_GAL_2024 <-rowSums(full_data_customer[, grep("^QTD_ORD_GAL_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_DLV_CA_2023 <-rowSums(full_data_customer[, grep("^QTD_DLV_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_DLV_GAL_2023 <-rowSums(full_data_customer[, grep("^QTD_DLV_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_DLV_CA_2024 <-rowSums(full_data_customer[, grep("^QTD_DLV_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_DLV_GAL_2024 <-rowSums(full_data_customer[, grep("^QTD_DLV_GAL_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_RET_CA_2023 <-rowSums(full_data_customer[, grep("^QTD_RET_CA_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_RET_GAL_2023 <-rowSums(full_data_customer[, grep("^QTD_RET_GAL_2023", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_RET_CA_2024 <-rowSums(full_data_customer[, grep("^QTD_RET_CA_2024", names(full_data_customer))], na.rm =TRUE)full_data_customer$QTD_RET_GAL_2024 <-rowSums(full_data_customer[, grep("^QTD_RET_GAL_2024", names(full_data_customer))], na.rm =TRUE)# 6. Create new columns for CUST_23 and CUST_24full_data_customer$ACTIVE_23 <-ifelse((full_data_customer$QTD_DLV_CA_2023 + full_data_customer$QTD_DLV_GAL_2023) >0, 1, 0)full_data_customer$ACTIVE_24 <-ifelse((full_data_customer$QTD_DLV_CA_2024 + full_data_customer$QTD_DLV_GAL_2024) >0, 1, 0)# Display the first few rows of the combined dataset#head(full_data_customer)``````{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Remove temporary data framesrm(list =c("trans_ordered_cases", "trans_ordered_gallons", "trans_delivered_cases", "trans_delivered_gallons", "trans_returned_cases", "trans_returned_gallons", "qtd_ordered_cases", "qtd_ordered_gallons", "qtd_delivered_cases", "qtd_delivered_gallons", "qtd_returned_cases", "qtd_returned_gallons", "trans_count", "order_type_count", "dlv_type_count"))# Remove temporary functionsrm(list =c("count_transactions", "sum_transactions"))# Remove column order vectorrm(column_order)# Remove unnecessary intermediate data framesrm(list =c("cols_to_average_dlv", "cols_to_average_gal", "cols_to_average_ord_ca", "cols_to_average_ord_gal", "cols_to_average_ret_ca", "cols_to_average_ret_gal", "ref_date"))```### 3.4 Estimated Delivery CostsThe delivery costs will reflect estimated volumes, as they were provided based on the median price within volume ranges and by type of COLD_DRINK_CHANNEL. ```{r}# Load the delivery cost data from the Excel filecost_data <-read_excel("delivery_cost_data.xlsx")# Convert 'Cold Drink Channel' to a factorcost_data <- cost_data %>%mutate(COLD_DRINK_CHANNEL =factor(`Cold Drink Channel`))# Manually recode 'Applicable To' valuescost_data <- cost_data %>%mutate(`Applicable To`=ifelse(`Applicable To`=="Bottles and Cans", "CASES", ifelse(`Applicable To`=="Fountain", "GALLONS", `Applicable To`)))# Create RANGE_LEVEL based on 'Vol Range' and make it a factorcost_data$RANGE_LEVEL <-factor(case_when( cost_data$`Vol Range`=="0 - 149"& cost_data$`Applicable To`=="CASES"~"RANGE_1_CASES", cost_data$`Vol Range`=="150 - 299"& cost_data$`Applicable To`=="CASES"~"RANGE_2_CASES", cost_data$`Vol Range`=="300 - 449"& cost_data$`Applicable To`=="CASES"~"RANGE_3_CASES", cost_data$`Vol Range`=="450 - 599"& cost_data$`Applicable To`=="CASES"~"RANGE_4_CASES", cost_data$`Vol Range`=="600 - 749"& cost_data$`Applicable To`=="CASES"~"RANGE_5_CASES", cost_data$`Vol Range`=="750 - 899"& cost_data$`Applicable To`=="CASES"~"RANGE_6_CASES", cost_data$`Vol Range`=="900 - 1049"& cost_data$`Applicable To`=="CASES"~"RANGE_7_CASES", cost_data$`Vol Range`=="1050 - 1199"& cost_data$`Applicable To`=="CASES"~"RANGE_8_CASES", cost_data$`Vol Range`=="1200 - 1349"& cost_data$`Applicable To`=="CASES"~"RANGE_9_CASES", cost_data$`Vol Range`=="1350+"& cost_data$`Applicable To`=="CASES"~"RANGE_10_CASES", cost_data$`Vol Range`=="0 - 149"& cost_data$`Applicable To`=="GALLONS"~"RANGE_1_GALLONS", cost_data$`Vol Range`=="150 - 299"& cost_data$`Applicable To`=="GALLONS"~"RANGE_2_GALLONS", cost_data$`Vol Range`=="300 - 449"& cost_data$`Applicable To`=="GALLONS"~"RANGE_3_GALLONS", cost_data$`Vol Range`=="450 - 599"& cost_data$`Applicable To`=="GALLONS"~"RANGE_4_GALLONS", cost_data$`Vol Range`=="600 - 749"& cost_data$`Applicable To`=="GALLONS"~"RANGE_5_GALLONS", cost_data$`Vol Range`=="750 - 899"& cost_data$`Applicable To`=="GALLONS"~"RANGE_6_GALLONS", cost_data$`Vol Range`=="900 - 1049"& cost_data$`Applicable To`=="GALLONS"~"RANGE_7_GALLONS", cost_data$`Vol Range`=="1050 - 1199"& cost_data$`Applicable To`=="GALLONS"~"RANGE_8_GALLONS", cost_data$`Vol Range`=="1200 - 1349"& cost_data$`Applicable To`=="GALLONS"~"RANGE_9_GALLONS", cost_data$`Vol Range`=="1350+"& cost_data$`Applicable To`=="GALLONS"~"RANGE_10_GALLONS"))# Reorder columns to keep only the desired ones: COLD_DRINK_CHANNEL, VOL_RANGE, RANGE_LEVEL, MEDIAN DELIVERY COSTcost_data <- cost_data[, c("COLD_DRINK_CHANNEL", "Vol Range", "RANGE_LEVEL", "Median Delivery Cost")]# Check the result#head(cost_data)```The necessary variables will be created to calculate the delivery costs for cases and gallons for the years 2023 and 2024 by customer.```{r}# Create cost range columns for each year based on quantitiesfull_data_customer <- full_data_customer %>%mutate(# For 2023, categorize based on quantity ranges for casesCOST_RANGE_CA_23 =case_when( QTD_DLV_CA_2023 >=0& QTD_DLV_CA_2023 <150~"RANGE_1_CASES", QTD_DLV_CA_2023 >=150& QTD_DLV_CA_2023 <300~"RANGE_2_CASES", QTD_DLV_CA_2023 >=300& QTD_DLV_CA_2023 <450~"RANGE_3_CASES", QTD_DLV_CA_2023 >=450& QTD_DLV_CA_2023 <600~"RANGE_4_CASES", QTD_DLV_CA_2023 >=600& QTD_DLV_CA_2023 <750~"RANGE_5_CASES", QTD_DLV_CA_2023 >=750& QTD_DLV_CA_2023 <900~"RANGE_6_CASES", QTD_DLV_CA_2023 >=900& QTD_DLV_CA_2023 <1050~"RANGE_7_CASES", QTD_DLV_CA_2023 >=1050& QTD_DLV_CA_2023 <1200~"RANGE_8_CASES", QTD_DLV_CA_2023 >=1200& QTD_DLV_CA_2023 <1350~"RANGE_9_CASES", QTD_DLV_CA_2023 >=1350~"RANGE_10_CASES", TRUE~NA_character_),# For 2024, categorize based on quantity ranges for casesCOST_RANGE_CA_24 =case_when( QTD_DLV_CA_2024 >=0& QTD_DLV_CA_2024 <150~"RANGE_1_CASES", QTD_DLV_CA_2024 >=150& QTD_DLV_CA_2024 <300~"RANGE_2_CASES", QTD_DLV_CA_2024 >=300& QTD_DLV_CA_2024 <450~"RANGE_3_CASES", QTD_DLV_CA_2024 >=450& QTD_DLV_CA_2024 <600~"RANGE_4_CASES", QTD_DLV_CA_2024 >=600& QTD_DLV_CA_2024 <750~"RANGE_5_CASES", QTD_DLV_CA_2024 >=750& QTD_DLV_CA_2024 <900~"RANGE_6_CASES", QTD_DLV_CA_2024 >=900& QTD_DLV_CA_2024 <1050~"RANGE_7_CASES", QTD_DLV_CA_2024 >=1050& QTD_DLV_CA_2024 <1200~"RANGE_8_CASES", QTD_DLV_CA_2024 >=1200& QTD_DLV_CA_2024 <1350~"RANGE_9_CASES", QTD_DLV_CA_2024 >=1350~"RANGE_10_CASES",TRUE~NA_character_),# For 2023, categorize based on quantity ranges for gallonsCOST_RANGE_GAL_23 =case_when( QTD_DLV_GAL_2023 >=0& QTD_DLV_GAL_2023 <150~"RANGE_1_GALLONS", QTD_DLV_GAL_2023 >=150& QTD_DLV_GAL_2023 <300~"RANGE_2_GALLONS", QTD_DLV_GAL_2023 >=300& QTD_DLV_GAL_2023 <450~"RANGE_3_GALLONS", QTD_DLV_GAL_2023 >=450& QTD_DLV_GAL_2023 <600~"RANGE_4_GALLONS", QTD_DLV_GAL_2023 >=600& QTD_DLV_GAL_2023 <750~"RANGE_5_GALLONS", QTD_DLV_GAL_2023 >=750& QTD_DLV_GAL_2023 <900~"RANGE_6_GALLONS", QTD_DLV_GAL_2023 >=900& QTD_DLV_GAL_2023 <1050~"RANGE_7_GALLONS", QTD_DLV_GAL_2023 >=1050& QTD_DLV_GAL_2023 <1200~"RANGE_8_GALLONS", QTD_DLV_GAL_2023 >=1200& QTD_DLV_GAL_2023 <1350~"RANGE_9_GALLONS", QTD_DLV_GAL_2023 >=1350~"RANGE_10_GALLONS", TRUE~NA_character_),# For 2024, categorize based on quantity ranges for gallonsCOST_RANGE_GAL_24 =case_when( QTD_DLV_GAL_2024 >=0& QTD_DLV_GAL_2024 <150~"RANGE_1_GALLONS", QTD_DLV_GAL_2024 >=150& QTD_DLV_GAL_2024 <300~"RANGE_2_GALLONS", QTD_DLV_GAL_2024 >=300& QTD_DLV_GAL_2024 <450~"RANGE_3_GALLONS", QTD_DLV_GAL_2024 >=450& QTD_DLV_GAL_2024 <600~"RANGE_4_GALLONS", QTD_DLV_GAL_2024 >=600& QTD_DLV_GAL_2024 <750~"RANGE_5_GALLONS", QTD_DLV_GAL_2024 >=750& QTD_DLV_GAL_2024 <900~"RANGE_6_GALLONS", QTD_DLV_GAL_2024 >=900& QTD_DLV_GAL_2024 <1050~"RANGE_7_GALLONS", QTD_DLV_GAL_2024 >=1050& QTD_DLV_GAL_2024 <1200~"RANGE_8_GALLONS", QTD_DLV_GAL_2024 >=1200& QTD_DLV_GAL_2024 <1350~"RANGE_9_GALLONS", QTD_DLV_GAL_2024 >=1350~"RANGE_10_GALLONS",TRUE~NA_character_ ))# First join for UNIT_COST_CA_23full_data_customer <- full_data_customer %>%left_join(cost_data %>% dplyr::select(COLD_DRINK_CHANNEL, RANGE_LEVEL, `Median Delivery Cost`), by =c("COLD_DRINK_CHANNEL"="COLD_DRINK_CHANNEL", "COST_RANGE_CA_23"="RANGE_LEVEL")) %>%mutate(UNIT_COST_CA_23 =`Median Delivery Cost`) %>% dplyr::select(-`Median Delivery Cost`) # Remove unwanted column# Second join for UNIT_COST_CA_24full_data_customer <- full_data_customer %>%left_join(cost_data %>% dplyr::select(COLD_DRINK_CHANNEL, RANGE_LEVEL, `Median Delivery Cost`), by =c("COLD_DRINK_CHANNEL"="COLD_DRINK_CHANNEL", "COST_RANGE_CA_24"="RANGE_LEVEL")) %>%mutate(UNIT_COST_CA_24 =`Median Delivery Cost`) %>% dplyr::select(-`Median Delivery Cost`) # Remove unwanted column# Third join for UNIT_COST_GAL_23full_data_customer <- full_data_customer %>%left_join(cost_data %>% dplyr::select(COLD_DRINK_CHANNEL, RANGE_LEVEL, `Median Delivery Cost`), by =c("COLD_DRINK_CHANNEL"="COLD_DRINK_CHANNEL", "COST_RANGE_GAL_23"="RANGE_LEVEL")) %>%mutate(UNIT_COST_GAL_23 =`Median Delivery Cost`) %>% dplyr::select(-`Median Delivery Cost`) # Remove unwanted column# Fourth join for UNIT_COST_GAL_24full_data_customer <- full_data_customer %>%left_join(cost_data %>% dplyr::select(COLD_DRINK_CHANNEL, RANGE_LEVEL, `Median Delivery Cost`), by =c("COLD_DRINK_CHANNEL"="COLD_DRINK_CHANNEL", "COST_RANGE_GAL_24"="RANGE_LEVEL")) %>%mutate(UNIT_COST_GAL_24 =`Median Delivery Cost`) %>% dplyr::select(-`Median Delivery Cost`) # Remove unwanted column# Calculating delivery costs for each year and drink typefull_data_customer <- full_data_customer %>%mutate(COST_CA_23 = QTD_DLV_CA_2023 * UNIT_COST_CA_23,COST_CA_24 = QTD_DLV_CA_2024 * UNIT_COST_CA_24,COST_GAL_23 = QTD_DLV_GAL_2023 * UNIT_COST_GAL_23,COST_GAL_24 = QTD_DLV_GAL_2024 * UNIT_COST_GAL_24 )# Format unit costs and costs to two decimal placesfull_data_customer <- full_data_customer %>%mutate(UNIT_COST_CA_23 =round(UNIT_COST_CA_23, 8),UNIT_COST_CA_24 =round(UNIT_COST_CA_24, 8),UNIT_COST_GAL_23 =round(UNIT_COST_GAL_23, 8),UNIT_COST_GAL_24 =round(UNIT_COST_GAL_24, 8),COST_CA_23 =round(COST_CA_23, 8),COST_CA_24 =round(COST_CA_24, 8),COST_GAL_23 =round(COST_GAL_23, 8),COST_GAL_24 =round(COST_GAL_24, 8))```The table below presents the information that constitutes the calculation of the delivery cost per customer.```{r, warning=FALSE}# Costs tablesummary_table <-as.data.table(full_data_customer)[, .( CUSTOMER_NUMBER, COLD_DRINK_CHANNEL,QTD_DLV_CA_2023 =round(QTD_DLV_CA_2023, 0), QTD_DLV_CA_2024 =round(QTD_DLV_CA_2024, 0), QTD_DLV_GAL_2023 =round(QTD_DLV_GAL_2023, 0), QTD_DLV_GAL_2024 =round(QTD_DLV_GAL_2024, 0), COST_RANGE_CA_23, COST_RANGE_CA_24, COST_RANGE_GAL_23, COST_RANGE_GAL_24,UNIT_COST_CA_23 =round(UNIT_COST_CA_23, 2), UNIT_COST_CA_24 =round(UNIT_COST_CA_24, 2), UNIT_COST_GAL_23 =round(UNIT_COST_GAL_23, 2), UNIT_COST_GAL_24 =round(UNIT_COST_GAL_24, 2),COST_CA_23 =round(COST_CA_23, 2), COST_CA_24 =round(COST_CA_24, 2), COST_GAL_23 =round(COST_GAL_23, 2), COST_GAL_24 =round(COST_GAL_24, 2))]# Display the table interactivelydatatable(summary_table, options =list(pageLength =5, scrollX =TRUE, scrollY =TRUE))```All costs are being calculated correctly. At this moment, percentage variations for the number of operations, demands, and costs have not been generated because not all customers have a history for 2023 and 2024, which prevents such calculations. However, methods to quantify the growth of each customer will be identified later.### 3.5 Target Variables: Initial Assumptions```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Adding new columns to full_data and full_data_customer# QTD deliveredfull_data_customer$QTD_DLV_CA_GAL_2023 <- full_data_customer$QTD_DLV_CA_2023 + full_data_customer$QTD_DLV_GAL_2023full_data_customer$QTD_DLV_CA_GAL_2024 <- full_data_customer$QTD_DLV_CA_2024 + full_data_customer$QTD_DLV_GAL_2024# Creating the QTD_DLV_TOTAL variablefull_data_customer$QTD_DLV_TOTAL <- full_data_customer$QTD_DLV_CA_GAL_2023 + full_data_customer$QTD_DLV_CA_GAL_2024# Create the TOTAL_COST_CA_GAL columnfull_data_customer$TOTAL_COST_CA_GAL <- full_data_customer$COST_CA_23 + full_data_customer$COST_CA_24 + full_data_customer$COST_GAL_23 + full_data_customer$COST_GAL_24# Adding cost-related columns from full_data_customer to full_data based on CUSTOMER_NUMBERfull_data$UNIT_COST_CA_23 <-as.numeric(full_data_customer$UNIT_COST_CA_23[match(full_data$CUSTOMER_NUMBER, full_data_customer$CUSTOMER_NUMBER)])full_data$UNIT_COST_GAL_23 <-as.numeric(full_data_customer$UNIT_COST_GAL_23[match(full_data$CUSTOMER_NUMBER, full_data_customer$CUSTOMER_NUMBER)])full_data$UNIT_COST_CA_24 <-as.numeric(full_data_customer$UNIT_COST_CA_24[match(full_data$CUSTOMER_NUMBER, full_data_customer$CUSTOMER_NUMBER)])full_data$UNIT_COST_GAL_24 <-as.numeric(full_data_customer$UNIT_COST_GAL_24[match(full_data$CUSTOMER_NUMBER, full_data_customer$CUSTOMER_NUMBER)])# Creating the DLV_COST_CA columnfull_data$DLV_COST_CA <-ifelse(full_data$YEAR ==2023, full_data$DELIVERED_CASES * full_data$UNIT_COST_CA_23, ifelse(full_data$YEAR ==2024, full_data$DELIVERED_CASES * full_data$UNIT_COST_CA_24, NA))# Creating the DLV_COST_GAL columnfull_data$DLV_COST_GAL <-ifelse(full_data$YEAR ==2023, full_data$DELIVERED_GALLONS * full_data$UNIT_COST_GAL_23, ifelse(full_data$YEAR ==2024, full_data$DELIVERED_GALLONS * full_data$UNIT_COST_GAL_24, NA))# Creating the DLV_COST_TOTAL columnfull_data$DLV_COST_TOTAL <- full_data$DLV_COST_CA + full_data$DLV_COST_GAL``````{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Generating the summary table for TOTAL_COST_CA_GAL in full_data_customerfull_data_customer %>%summarise(`Sum full_data_customer TOTAL_COST_CA_GAL`=formatC(sum(TOTAL_COST_CA_GAL, na.rm =TRUE), format ="f", digits =1, big.mark =","),Max =formatC(max(TOTAL_COST_CA_GAL, na.rm =TRUE), format ="f", digits =1, big.mark =","),Mean =formatC(mean(TOTAL_COST_CA_GAL, na.rm =TRUE), format ="f", digits =1, big.mark =","),Median =formatC(median(TOTAL_COST_CA_GAL, na.rm =TRUE), format ="f", digits =1, big.mark =","))```As initially explained, we will establish classifications related to the target variables to create an initial reference point.#### 3.5.1 - Demand Threshold and Fleet AssingmentThe average annual consumption per customer will be calculated and customers will be classified based on whether they exceed the threshold of 400 units (cases plus gallons).```{r}# Calculating the averagefull_data_customer$AVG_ANNUAL_CONSUMP <-round((full_data_customer$QTD_DLV_CA_GAL_2023 + full_data_customer$QTD_DLV_CA_GAL_2024) /2, 1)# Creating the THRESHOLD_REACH variablefull_data_customer$THRESHOLD_REACH <-ifelse(full_data_customer$AVG_ANNUAL_CONSUMP <400, 0, 1)# Summarize data by THRESHOLD_REACHdata_threshold_reach <- full_data_customer %>%group_by(THRESHOLD_REACH) %>%summarise(CustomerCount =n(), .groups ='drop') %>%mutate(Percentage =round(CustomerCount /sum(CustomerCount) *100, 1)) # Calculate percentage# Display the tablekable(data_threshold_reach, col.names =c("Threshold Reach", "Customer Count", "Percentage (%)"), format ="simple")```About 23,081 (76%) of all customers did not reach the threshold of 400 gallons on average per year, while the remaining 7,239 did.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Summarize data by THRESHOLD_REACHdata_threshold_reach <- full_data_customer %>%group_by(THRESHOLD_REACH) %>%summarise(CustomerCount =n(), .groups ='drop') %>%mutate(Percentage =round(CustomerCount /sum(CustomerCount) *100, 1)) # Calculate percentage# Create bar plot for THRESHOLD_REACH with both numbers and percentages in the labelsggplot(data_threshold_reach, aes(x =factor(THRESHOLD_REACH), y = CustomerCount, fill =factor(THRESHOLD_REACH))) +geom_bar(stat ="identity", position ="stack", alpha =0.7) +geom_text(aes(label =paste(CustomerCount, "(", Percentage, "%)", sep ="")), position =position_stack(vjust =0.5), color ="black", size =3.2) +labs(title ="Percentage of Customers by Threshold Reach",x ="Threshold Reach",y ="Customer Count") +scale_fill_manual(values =c("0"="lightblue", "1"="lightgreen")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),axis.title =element_text(size =10),panel.grid.major =element_blank(),panel.grid.minor =element_blank()) +guides(fill =guide_legend(title ="Threshold Reach"))# Summarize data by THRESHOLD_REACH, summing TOTAL_COST_CA_GALdata_threshold_cost <- full_data_customer %>%group_by(THRESHOLD_REACH) %>%summarise(TotalCostSum =sum(TOTAL_COST_CA_GAL, na.rm =TRUE), .groups ='drop') %>%mutate(Percentage =round(TotalCostSum /sum(TotalCostSum) *100, 1), # Calculate percentageTotalCostSumMillions = TotalCostSum /1e6) # Convert to millions# Create bar plot for TOTAL_COST_CA_GAL showing only percentages in the labels, and total in millions above barsggplot(data_threshold_cost, aes(x =factor(THRESHOLD_REACH), y = Percentage, fill =factor(THRESHOLD_REACH))) +geom_bar(stat ="identity", position ="stack", alpha =0.7) +geom_text(aes(label =paste(Percentage, "%", sep ="")), # Show percentage onlyposition =position_stack(vjust =0.5), color ="black", size =3.2) +geom_text(aes(label =paste(scales::comma(TotalCostSumMillions, accuracy =0.1), "M", sep ="")), # Show total in millions above barsposition =position_nudge(y =2), # Adjust the position to be above the barcolor ="black", size =3.5, fontface ="bold") +labs(title ="Percentage of Total Cost by Threshold Reach",x ="Threshold Reach",y ="Percentage") +scale_fill_manual(values =c("0"="lightcoral", "1"="lightseagreen")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),axis.title =element_text(size =10),panel.grid.major =element_blank(), # Remove both vertical and horizontal grid linespanel.grid.minor =element_blank()) +# Remove minor grid linesguides(fill =guide_legend(title ="Threshold Reach"))```Customers who exceed 400 units annually will be assigned to Red Trucks, while the remaining customers will be allocated to White Trucks.```{r}# Create the FLEET_TYPE column based on THRESHOLD_REACH onlyfull_data_customer$FLEET_TYPE <-ifelse(full_data_customer$THRESHOLD_REACH ==1, "RED TRUCK", "WHITE TRUCK")# Group and calculate the number of customers by FLEET_TYPE and LOCAL_FOUNT_ONLYsummary_fleet_type <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_customers =n(),.groups ="drop" )# Calculate percentage of customers within each LOCAL_FOUNT_ONLY group separatelysummary_fleet_type <- summary_fleet_type %>%group_by(LOCAL_FOUNT_ONLY) %>%mutate(pct_customers = total_customers /sum(total_customers) *100# Calculate the percentage within each LOCAL_FOUNT_ONLY group )# Transform data into long format for percentagessummary_fleet_type_long <- summary_fleet_type %>%pivot_longer(cols =starts_with("pct_"),names_to ="metric",values_to ="percentage" ) %>%mutate(metric =factor(metric, levels =c("pct_customers"),labels =c("Percentage of Customers")) )# Ensure LOCAL_FOUNT_ONLY and FLEET_TYPE are factorssummary_fleet_type_long$LOCAL_FOUNT_ONLY <-factor(summary_fleet_type_long$LOCAL_FOUNT_ONLY, levels =c("0", "1"))summary_fleet_type_long$FLEET_TYPE <-factor(summary_fleet_type_long$FLEET_TYPE, levels =c("RED TRUCK", "WHITE TRUCK"))# Plot for percentages with FLEET_TYPE as colors and LOCAL_FOUNT_ONLY as groupsggplot(summary_fleet_type_long, aes(x = LOCAL_FOUNT_ONLY, y = percentage, fill = FLEET_TYPE)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label = scales::comma(percentage, suffix ="%")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +labs(title ="Percentage of Customers by Fleet Type and Local Fountain Only") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +# Set colors for RED and WHITE TRUCKtheme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_blank(), # Remove legend titlelegend.position ="right", # Position legend on the right sidelegend.box ="vertical", # Ensure vertical arrangement for the legendpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +guides(fill =guide_legend(title ="Fleet Type")) # Add a legend title# Group and calculate the number of customers by FLEET_TYPE and LOCAL_FOUNT_ONLYsummary_fleet_type_count <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_customers =n(),.groups ="drop" )# Display the summary with the count of customers by fleet type and LOCAL_FOUNT_ONLY#summary_fleet_type_count```According to these criteria, 13% of Local Fountain Only customers would be assigned to RED TRUCK. Among the other customers, 24% would receive deliveries via RED TRUCK.```{r}# Group by LOCAL_FOUNT_ONLY and FLEET_TYPE, then calculate the total delivered volume (QTD_DLV_TOTAL)summary_fleet_type_total <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_QTD_DLV =sum(QTD_DLV_TOTAL, na.rm =TRUE), # Sum of QTD_DLV_TOTAL for each FLEET_TYPE and LOCAL_FOUNT_ONLY.groups ="drop" )# Ensure LOCAL_FOUNT_ONLY and FLEET_TYPE are factors for better plottingsummary_fleet_type_total$LOCAL_FOUNT_ONLY <-factor(summary_fleet_type_total$LOCAL_FOUNT_ONLY, levels =c("0", "1"))summary_fleet_type_total$FLEET_TYPE <-factor(summary_fleet_type_total$FLEET_TYPE, levels =c("RED TRUCK", "WHITE TRUCK"))# Plot the total delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLYggplot(summary_fleet_type_total, aes(x = LOCAL_FOUNT_ONLY, y = total_QTD_DLV, fill = FLEET_TYPE)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label = scales::comma(total_QTD_DLV)), position =position_dodge(width =0.8), vjust =0.0, size =3.5) +labs(title ="Total Delivered Volume by Fleet Type and Local Fountain Only (23 & 24)") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +# Set colors for RED and WHITE TRUCKtheme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_blank(), # Remove legend titlelegend.position ="right", # Position legend on the right sidelegend.box ="vertical", # Ensure vertical arrangement for the legendpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +guides(fill =guide_legend(title ="Fleet Type")) # Add a legend title# Group and calculate the total delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLYsummary_fleet_type_count <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_QTD_DLV =sum(QTD_DLV_TOTAL, na.rm =TRUE),.groups ="drop" )# Display the summary with the total delivered volume by FLEET_TYPE and LOCAL_FOUNT_ONLY#summary_fleet_type_count```The vast majority of the volume would be delivered by RED TRUCK (85% of the total), with the remaining portion delivered by WHITE TRUCK (15%).```{r}# Group by LOCAL_FOUNT_ONLY and FLEET_TYPE and calculate total delivered volume (QTD_DLV_TOTAL)summary_fleet_type_pct <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_QTD_DLV =sum(QTD_DLV_TOTAL, na.rm =TRUE), # Sum of QTD_DLV_TOTAL for each FLEET_TYPE and LOCAL_FOUNT_ONLY.groups ="drop" ) %>%group_by(LOCAL_FOUNT_ONLY) %>%mutate(pct_QTD_DLV = total_QTD_DLV /sum(total_QTD_DLV) *100# Calculate the percentage of delivered volume per LOCAL_FOUNT_ONLY group )# Ensure LOCAL_FOUNT_ONLY and FLEET_TYPE are factors for better plottingsummary_fleet_type_pct$LOCAL_FOUNT_ONLY <-factor(summary_fleet_type_pct$LOCAL_FOUNT_ONLY, levels =c("0", "1"))summary_fleet_type_pct$FLEET_TYPE <-factor(summary_fleet_type_pct$FLEET_TYPE, levels =c("RED TRUCK", "WHITE TRUCK"))# Plot the percentage of delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLYggplot(summary_fleet_type_pct, aes(x = LOCAL_FOUNT_ONLY, y = pct_QTD_DLV, fill = FLEET_TYPE)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label =paste0(round(pct_QTD_DLV, 1), "%")), position =position_dodge(width =0.8), vjust =0.0, size =3.5) +labs(title ="Percentage of Delivered Volume by Fleet Type and Local Fountain Only (23 & 24)") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +# Set colors for RED and WHITE TRUCKtheme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_blank(), # Remove legend titlelegend.position ="right", # Position legend on the right sidelegend.box ="vertical", # Ensure vertical arrangement for the legendpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +guides(fill =guide_legend(title ="Fleet Type")) # Add a legend title# Group and calculate the percentage of delivered volume (QTD_DLV_TOTAL) by FLEET_TYPE and LOCAL_FOUNT_ONLYsummary_fleet_type_count_pct <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, FLEET_TYPE) %>%summarise(total_QTD_DLV =sum(QTD_DLV_TOTAL, na.rm =TRUE),.groups ="drop" ) %>%group_by(LOCAL_FOUNT_ONLY) %>%mutate(pct_QTD_DLV = total_QTD_DLV /sum(total_QTD_DLV) *100# Percentage of delivered volume within each group )# Display the summary with the percentage of delivered volume by FLEET_TYPE and LOCAL_FOUNT_ONLY#summary_fleet_type_count_pct```Considering the customer groups independently, nearly 59% of the volume delivered to local partners purchasing fountain only would be transported by RED TRUCKS, while for the remaining customers, almost 85% of the volume would be delivered by RED TRUCKS.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Convert to FLEET_TYPE to factorfull_data_customer$FLEET_TYPE <-as.factor(full_data_customer$FLEET_TYPE)# Add FLEET_TYPE to the full_customer (by transactions dataset)full_data$FLEET_TYPE <- full_data_customer$FLEET_TYPE[match(full_data$CUSTOMER_NUMBER, full_data_customer$CUSTOMER_NUMBER)]# List all variables in the environmentall_vars <-ls()# Exclude 'full_data', 'full_data_customer', and the new variables from removalvars_to_keep <-c("full_data", "full_data_customer", "cost_data", "customer_address", "mydir", "one_seed", "op_data", "profile_data", "reference_date","custom_palette")# Get the variables to removevars_to_remove <-setdiff(all_vars, vars_to_keep)# Remove the temporary data framesrm(list = vars_to_remove)# Clean up by removing 'all_vars' and 'vars_to_remove'rm(all_vars, vars_to_remove)```### 3.6 - Questions and Considerations on Missing Data and Unknown ClassesAfter the first portion of the EDA, there is a better understanding of the data, but not all questions have been answered. These will continue to be explored in the next section, though some may remain unresolved due to the nature of the questions. The following questions have been identified:- Based on the available data, what would be a robust statistical approach to calculate the customer growth rate? A simplistic approach was initially used, relying on the average as a reference to visualize the data. However, a more validated method could certainly be applied.- What is the average load capacity of a Red Truck compared to a White Truck?- Adding an ID for individual account executives to the customer profile data could be valuable. Is the quality of the account executive a confounding variable when looking at high growth rate customers?- Does the company set a delivery deadline in days or hours?## 4. Exploratory Data Analysis (EDA) - Part IIAfter completing the initial analysis and building the datasets, focusing on the set objectives, we will explore more detailed information about the customers.### 4.1 Customers overview**Geographical Distribution of Customers**Although the location data is not real, below you can observe its distribution.```{r}# Load the U.S. mapus_map <-map_data("state")# Create the plotggplot() +geom_polygon(data = us_map, aes(x = long, y = lat, group = group),fill ="lightblue", color ="white") +geom_point(data = full_data_customer, aes(x = LONGITUDE, y = LATITUDE),color ="#B33951", alpha =0.6, size =0.5) +coord_fixed(1.3) +theme_minimal() +labs(title ="Customers Geographical Distribution") +theme(axis.text.x =element_blank(),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),panel.grid.major =element_blank(),panel.grid.minor =element_blank())``````{r, results='hide'}# Calculate the number of unique customersunique_customers <-length(unique(full_data_customer$CUSTOMER_NUMBER))# Display the frequency of each value for the 'CHAIN_MEMBER' columnchain_member_count <-table(full_data_customer$CHAIN_MEMBER)# Calculate the number of unique primary group numbersunique_primary_groups <-length(unique(full_data_customer$PRIMARY_GROUP_NUMBER))# Sum the costs for cases and gallons in 2023 and 2024cost_dlv <-sum(full_data_customer$COST_CA_23, full_data_customer$COST_CA_24, full_data_customer$COST_GAL_23, full_data_customer$COST_GAL_24, na.rm =TRUE)# Summing the number of transactions for cases and gallons in 2023 and 2024trans_dlv <-sum(full_data_customer$NUM_TRANS_DLV_CA_23, full_data_customer$NUM_TRANS_DLV_CA_24, full_data_customer$NUM_TRANS_DLV_GAL_23, full_data_customer$NUM_TRANS_DLV_GAL_24, na.rm =TRUE)# Summing the quantity delivered of cases and gallons in 2023 and 2024qtd_dlv <-sum( full_data_customer$QTD_DLV_CA_2023, full_data_customer$QTD_DLV_GAL_2023, full_data_customer$QTD_DLV_CA_2024, full_data_customer$QTD_DLV_GAL_2024,na.rm =TRUE)# Average cost per delivery transactionavg_cost_per_transaction <- cost_dlv / trans_dlv# Average cost per case or gallon deliveredavg_cost_per_quantity <- cost_dlv / qtd_dlv# Display resultsunique_customers # Number of unique customerschain_member_count # Frequency count of each chain memberunique_primary_groups # Number of unique primary group numberscost_dlv # Total cost for cases and gallons in 2023 and 2024trans_dlv # Total number of transactions for cases and gallons in 2023 and 2024avg_cost_per_transaction # Average cost per delivery transactionavg_cost_per_quantity # Average cost per case or gallon delivered```After removing customers who did not make any transactions in 2023 and 2024, there are **30,320 unique customers** who made transactions during these years. Of these, **18,061 are unique outlets**, while **12,259 belong to 1,020 different chains** that have transacted with the company.All of their delivery transactions represented a total cost of approximately **$67,907,394**, with an average of **$55.8 per delivery transaction** and **$1.88 per case or gallon delivered**.### 4.2 Local Market Partners (Fountain Only)```{r}# Cleanclean_data <- full_data %>%filter(!is.na(LOCAL_FOUNT_ONLY)) %>%# Filtering data where LOCAL_FOUNT_ONLY is not NAmutate(LOCAL_FOUNT_ONLY =factor(LOCAL_FOUNT_ONLY, levels =c("0", "1"))) # Converting to factor# Aggregate data by LOCAL_FOUNT_ONLY and create the plotsummary_data <- clean_data %>%group_by(LOCAL_FOUNT_ONLY) %>%summarise(customers =n_distinct(CUSTOMER_NUMBER),transactions =n(),qtd_cas =sum(DELIVERED_CASES, na.rm =TRUE),qtd_gal =sum(DELIVERED_GALLONS, na.rm =TRUE),total_qtd = qtd_cas + qtd_gal,.groups ="drop" ) %>%mutate(pct_cust = customers /sum(customers) *100,pct_trans = transactions /sum(transactions) *100,pct_qtd = total_qtd /sum(total_qtd) *100,pct_gal = qtd_gal /sum(qtd_gal) *100 ) %>%rename(LFO = LOCAL_FOUNT_ONLY)# Transform data to long formatsummary_data_long <- summary_data %>%pivot_longer(cols =starts_with("pct_"), names_to ="metric", values_to ="percentage") %>%mutate(metric =factor(metric, levels =c("pct_cust", "pct_trans", "pct_gal", "pct_qtd"),labels =c("Customers", "Delivery Transactions", "Gallons", "Total (Cases+Gallons)")) )# Convert LFO to factorsummary_data_long$LFO <-factor(summary_data_long$LFO, levels =c("0", "1"))# Create the plotggplot(summary_data_long, aes(x = LFO, y = percentage, fill = LFO)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label =paste0(round(percentage, 1), "%")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +facet_wrap(~ metric, scales ="fixed", ncol =2) +labs(title ="Percentage Breakdown by Consumption Pattern") +scale_fill_manual(values =c("0"="#8ED081", "1"="#A7ADC6")) +scale_y_continuous(labels =percent_format(scale =1)) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.position ="none",panel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only"))```Local market partners who purchase only fountain drinks (Gallons) account for 4.5% of the customers and represent 6% of the company's gallons demand. Their delivery transaction volume is low, contributing only 3%, and the volume delivered accounts for just 1.6% of the total negotiated volume.```{r}# Group and calculate sums and percentagessummary_full_data <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY) %>%summarise(total_cost_gal =sum(COST_GAL_23, na.rm =TRUE) +sum(COST_GAL_24, na.rm =TRUE),total_cost_ca =sum(COST_CA_23, na.rm =TRUE) +sum(COST_CA_24, na.rm =TRUE),total_cost_all = total_cost_gal + total_cost_ca,.groups ="drop" ) %>%mutate(pct_cost_gal = total_cost_gal / total_cost_all *100,pct_cost_ca = total_cost_ca / total_cost_all *100,pct_total = total_cost_all /sum(total_cost_all) *100 ) %>%rename(LFO = LOCAL_FOUNT_ONLY)# Transform data into long format for totalssummary_full_data_long <- summary_full_data %>%pivot_longer(cols =starts_with("total_"), names_to ="metric", values_to ="value" ) %>%mutate(metric =factor(metric, levels =c("total_cost_gal", "total_cost_ca", "total_cost_all"),labels =c("Cost Gallons (23 & 24)", "Cost Cases (23 & 24)", "Total Cost")) )# For percentagessummary_full_data_pct_long <- summary_full_data %>%pivot_longer(cols =starts_with("pct_"), names_to ="metric", values_to ="percentage" ) %>%mutate(metric =factor(metric, levels =c("pct_cost_gal", "pct_cost_ca", "pct_total"),labels =c("Percentage Cost Gallons (23 & 24)", "Percentage Cost Cases (23 & 24)", "Percentage Total Cost")) )# Ensure LFO is a factorsummary_full_data_long$LFO <-factor(summary_full_data_long$LFO, levels =c("0", "1"))# Plot for total costsggplot(summary_full_data_long, aes(x = LFO, y = value, fill = LFO)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label = scales::comma(value, prefix ="$")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +facet_wrap(~ metric, scales ="fixed", nrow =1) +labs(title ="Total Costs by Consumption Pattern") +scale_fill_manual(values =c("0"="#8ED081", "1"="#A7ADC6")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.position ="none",panel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only"))```In the years 2023 and 2024, the total delivery cost was 67.9 million, of which only 1.2 million was allocated to local market partners.```{r}# Group and calculate sums and percentages by LFOsummary_full_data <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY) %>%summarise(total_cost_gal =sum(COST_GAL_23, na.rm =TRUE) +sum(COST_GAL_24, na.rm =TRUE),total_cost_ca =sum(COST_CA_23, na.rm =TRUE) +sum(COST_CA_24, na.rm =TRUE),total_cost_all = total_cost_gal + total_cost_ca,.groups ="drop" ) %>%mutate(pct_cost_gal = total_cost_gal /sum(total_cost_gal) *100, pct_cost_ca = total_cost_ca /sum(total_cost_ca) *100, pct_total = total_cost_all /sum(total_cost_all) *100 ) %>%rename(LFO = LOCAL_FOUNT_ONLY)# Transform data into long format for percentagessummary_full_data_pct_long <- summary_full_data %>%pivot_longer(cols =starts_with("pct_"), names_to ="metric", values_to ="percentage" ) %>%mutate(metric =factor(metric, levels =c("pct_cost_gal", "pct_cost_ca", "pct_total"),labels =c("% Cost - Gallons", "% Cost - Cases", "% Total Cost")) )# Ensure LFO is a factorsummary_full_data_pct_long$LFO <-factor(summary_full_data_pct_long$LFO, levels =c("0", "1"))# Plot for percentagesggplot(summary_full_data_pct_long, aes(x = LFO, y = percentage, fill = LFO)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label = scales::comma(percentage, suffix ="%")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +facet_wrap(~ metric, scales ="fixed", nrow =1) +labs(title ="Percentage Costs by Consumption Pattern") +scale_fill_manual(values =c("0"="#8ED081", "1"="#A7ADC6")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.position ="none",panel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only"))```Thus, in 2023 and 2024, the local partners who consume only fountain accounted for 1.8% of the total delivery costs. When we look at their share specifically in gallon deliveries, their participation rises to 7.3%.### 4.3 Customers HistoryBelow is the chart showing the density of customers in relation to the start of their partnership and their first delivery.```{r}# Gather the data for ON_BOARDING_DATE and FIRST_DELIVERY_DATE, filtering out 2025 dataprofile_data_long <- profile_data %>%filter(!is.na(ON_BOARDING_DATE) &!is.na(FIRST_DELIVERY_DATE)) %>%# Filter out 2025 data to avoid showing deliveries in that yearfilter(format(FIRST_DELIVERY_DATE, "%Y") !="2025") %>%pivot_longer(cols =c(ON_BOARDING_DATE, FIRST_DELIVERY_DATE), names_to ="Event", values_to ="Date") %>%# Set factor levels to ensure ON_BOARDING_DATE appears first in the plotmutate(Event =factor(Event, levels =c("ON_BOARDING_DATE", "FIRST_DELIVERY_DATE")))# Create density plots with facet_wrapggplot(profile_data_long, aes(x = Date, fill = Event, color = Event)) +geom_density(alpha =0.5) +# Adjust transparency for better visualizationfacet_wrap(~ Event, scales ="free", ncol =2) +# Create facets for each variablelabs(title ="Density Plots of Onboarding and First Delivery Dates",x ="Date",y ="Density") +scale_fill_manual(values =c("steelblue", "orange")) +# Set custom colors (first delivery = orange)scale_color_manual(values =c("steelblue", "orange")) +scale_y_continuous(labels = scales::label_number()) +# Remove scientific notation on Y axistheme_minimal() +theme(legend.position ="none") # Remove the legend for a cleaner plot```The vast majority of customers started to appear after 2010. The figures for the first deliveries show that, since 2016, at least 2,000 customers have received their first delivery each year. There were peaks in 2016 and 2017. In 2024, there was a decrease in the number of customers receiving their first delivery compared to 2023.```{r, warning=FALSE}# Reshape data: Gather ON_BOARDING_DATE and FIRST_DELIVERY_DATEprofile_data_long <- profile_data %>%filter(!is.na(ON_BOARDING_DATE) &!is.na(FIRST_DELIVERY_DATE)) %>%pivot_longer(cols =c(ON_BOARDING_DATE, FIRST_DELIVERY_DATE), names_to ="Event", values_to ="Date")# Set factor levels to ensure ON_BOARDING_DATE appears first in the plotprofile_data_long$Event <-factor(profile_data_long$Event, levels =c("ON_BOARDING_DATE", "FIRST_DELIVERY_DATE"))# Ensure Date is in Date formatprofile_data_long$Date <-as.Date(profile_data_long$Date)# Create histograms with yearly aggregationggplot(profile_data_long, aes(x = Date, fill = Event)) +geom_histogram(binwidth =365, color ="black", alpha =0.5, position ="identity") +facet_wrap(~ Event, scales ="free_x", ncol =2) +# Free scaling for X axislabs(title ="Distribution of Customer Onboarding and First Delivery Dates",x ="Date",y ="Count") +scale_fill_manual(values =c("steelblue", "orange")) +# Custom colors for eventsscale_x_date(labels = scales::date_format("%Y"), expand =c(0.01, 0.01)) +# Show only year on X-axisscale_y_continuous(labels = scales::label_number()) +# Ensure the y-axis is not in scientific notationtheme_minimal() +theme(legend.position ="none", # Remove legendaxis.text.x =element_text(hjust =-0.1), axis.ticks.x =element_blank(), panel.grid.major.x =element_blank(), # Remove vertical gridlinespanel.grid.minor.x =element_blank()) # Remove minor vertical gridlines```### 4.4 Order TypesThe way orders are placed and by whom is important for understanding customer growth potential. Most customer profiles are associated with sales representatives (65.7%). Other methods follow with 17.6%, and MyCoke 360 accounts for nearly 8%, despite only being launched in Summer 2024 to replace MyCoke Legacy.However, when analyzing actual transactions from 2023 and 2024, the distribution of order types differs significantly from the customer profiles. For example, sales representatives were responsible for only 27.5% of the orders, not 65.7% as listed in the profiles. Therefore, the analysis will be based on actual transactions rather than profile data.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Summary data by FREQUENT_ORDER_TYPEdata_summary_order_type <- full_data_customer %>%group_by(FREQUENT_ORDER_TYPE) %>%summarise(Count =n()) %>%mutate(Percentage =round(Count /sum(Count) *100, 1))# Define the custom color palette (Neutral colors from RColorBrewer's "Set3")custom_palette_type <-brewer.pal(6, "Set3") # A 6-color palette from Set3# Create the horizontal bar chart with percentagesggplot(data_summary_order_type, aes(x = Count, y =reorder(FREQUENT_ORDER_TYPE, Count), fill = FREQUENT_ORDER_TYPE)) +geom_bar(stat ="identity", position ="stack", alpha =0.6) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="Percentage of Customers by Frequent Order Type", x =NULL, y =NULL) +scale_x_continuous(labels =NULL, expand =expansion(c(0, 0.05))) +scale_fill_manual(values = custom_palette_type) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), legend.position ="none", panel.grid.major =element_blank(), panel.grid.minor =element_blank()) ```Below are the percentages cases ordered in 2023 and 2024 by order type for each transaction placed in 2023 and 2024.```{r}# Define the custom color palette (Neutral colors from RColorBrewer's "Set3")custom_palette_type <-brewer.pal(6, "Set3") # A 6-color palette from Set3# Summarize data by ORDER_TYPE, summing ORDERED_CASESdata_summary_order_type <- full_data %>%group_by(ORDER_TYPE) %>%summarise(OrderedCasesSum =sum(ORDERED_CASES, na.rm =TRUE), .groups ='drop') %>%mutate(Percentage =round(OrderedCasesSum /sum(OrderedCasesSum) *100, 1),Percentage =ifelse(Percentage <0.15, NA, Percentage)) # Set values less than 0.15% to NA (not displayed)# Create the horizontal bar chart with percentages, now with no aggregation by LOCAL_FOUNT_ONLYggplot(data_summary_order_type, aes(x = OrderedCasesSum, y =reorder(ORDER_TYPE, OrderedCasesSum), fill = ORDER_TYPE)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =ifelse(!is.na(Percentage), paste(Percentage, "%"), "")), # Only display text if Percentage is not NAposition =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="Percentage of Ordered Case Volumes by Order Type (23 & 24)",x =NULL, y =NULL) +scale_x_continuous(labels =NULL, expand =expansion(c(0, 0.05))) +scale_fill_manual(values = custom_palette_type) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), legend.position ="none", # Remove the legend, as we don't need it anymorepanel.grid.major =element_blank(), panel.grid.minor =element_blank())```In the 2023 and 2024 ordered cases transactions, it's clear that the majority of operations were carried out through digital channels, specifically MyCoke Legacy and MyCoke 360, accounting for 35.5%. This was followed by sales representatives with 25.4%, and call centers with 15.5%. MyCoke 360, which was recently launched, makes up 7.1% of the transactions.```{r}# Summarize data by ORDER_TYPE and LOCAL_FOUNT_ONLY, summing ORDERED_GALLONSdata_summary_order_type <- full_data %>%group_by(ORDER_TYPE, LOCAL_FOUNT_ONLY) %>%summarise(OrderedGallonsSum =sum(ORDERED_GALLONS, na.rm =TRUE), .groups ='drop') %>%mutate(Percentage =round(OrderedGallonsSum /sum(OrderedGallonsSum) *100, 1),Percentage =ifelse(Percentage <0.0, NA, Percentage)) # Set values less than 0.15% to NA (not displayed)# Create the horizontal bar chart with percentages, facet by LOCAL_FOUNT_ONLYggplot(data_summary_order_type, aes(x = OrderedGallonsSum, y =reorder(ORDER_TYPE, OrderedGallonsSum), fill = ORDER_TYPE)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =ifelse(!is.na(Percentage), paste(Percentage, "%"), "")), # Only display text if Percentage is not NAposition =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="% of Ordered Gallons by Order Type and Customer type (2023 & 2024)",x =NULL, y =NULL) +scale_x_continuous(labels =NULL, expand =expansion(c(0, 0.05))) +scale_fill_manual(values = custom_palette_type) +# Apply the custom color palette for ORDER_TYPEtheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), legend.position ="none", # Hide the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank(),strip.text =element_text(face ="bold", size =10), strip.background =element_blank()) +facet_wrap(~ LOCAL_FOUNT_ONLY, scales ="free_y", ncol =2, labeller =labeller(LOCAL_FOUNT_ONLY =c('0'='Others', '1'='Local Fountain Only'))) # Facet labels```For gallon orders, only a very small fraction (less than 6%) is represented by Local Market Partners that order Fountain Only. For these customers, the majority of their orders are placed via the call center (2.4%), followed by digital channels (2.2%), and finally sales reps (1.3%).For the remaining customers, digital channels represent 34.6% (MyCoke360 + Legacy), sales reps 32.5%, and call centers 24.5%.It can be said that digital channels are the most used, accounting for approximately 35% of the total volume of cases and gallons for all customers. Sales reps have a smaller proportional share for case orders but carry more weight for gallon orders.```{r, warning=FALSE}# Summarize data by ORDER_TYPE, summing DELIVERED_CASES and DELIVERED_GALLONSdata_summary_order_type <- full_data %>%group_by(ORDER_TYPE) %>%summarise(DeliveredCasesSum =sum(DELIVERED_CASES, na.rm =TRUE),DeliveredGallonsSum =sum(DELIVERED_GALLONS, na.rm =TRUE),.groups ='drop' ) %>%mutate(TotalVolume = DeliveredCasesSum + DeliveredGallonsSum,Percentage =round(TotalVolume /sum(TotalVolume) *100, 1) )# Create horizontal bar chart with both absolute volume and percentageggplot(data_summary_order_type, aes(x = TotalVolume, y =reorder(ORDER_TYPE, TotalVolume), fill = ORDER_TYPE)) +geom_bar(stat ="identity", alpha =0.5) +geom_text(aes(label =paste(scales::comma(TotalVolume, accuracy =1), paste0("(", Percentage, "%)"))),position =position_stack(vjust =0.5),hjust =-0.01,color ="black",size =3.2 ) +scale_x_continuous(labels = scales::comma,breaks =seq(0, max(data_summary_order_type$TotalVolume), by =5000000),expand =expansion(c(0, 0.05)) ) +scale_fill_manual(values = custom_palette_type) +labs(title ="Total Delivered Cases and Gallons by Order Type (23 & 24)",x ="Volume (units)",y =NULL ) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.title.x =element_text(size =10, face ="plain"),legend.position ="none",panel.grid.major =element_blank(),panel.grid.major.x =element_line(color ="lightgray", size =0.5),panel.grid.minor =element_blank() )```In line with the previous points, digital channels account for nearly 36% of the total volume delivered in 2023 and 2024, followed by sales reps at 27.5% and call centers at 18.5%.```{r}# Summarize data by ORDER_TYPE, summing DELIVERED_CASES and DELIVERED_GALLONSdata_summary_order_type <- full_data %>%group_by(ORDER_TYPE) %>%summarise(DeliveredCasesSum =sum(DELIVERED_CASES, na.rm =TRUE),DeliveredGallonsSum =sum(DELIVERED_GALLONS, na.rm =TRUE),.groups ='drop' ) %>%mutate(TotalVolume = DeliveredCasesSum + DeliveredGallonsSum,Percentage =round(TotalVolume /sum(TotalVolume) *100, 1) )# Create horizontal bar chart with absolute volume and percentageggplot(data_summary_order_type, aes(x = TotalVolume, y =reorder(ORDER_TYPE, TotalVolume), fill = ORDER_TYPE)) +geom_bar(stat ="identity", alpha =0.5) +geom_text(aes(label =paste(scales::comma(TotalVolume, accuracy =1), paste0("(", Percentage, "%)"))),position =position_stack(vjust =0.5),hjust =-0.01,color ="black",size =3.2 ) +scale_x_continuous(labels = scales::comma,breaks =seq(0, max(data_summary_order_type$TotalVolume), by =5000000),expand =expansion(c(0, 0.05)) ) +scale_fill_manual(values = custom_palette_type) +labs(title ="Total Delivered Cases and Gallons by Order Type (23 & 24)",x ="Cost $",y =NULL ) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.title.x =element_text(size =10, face ="plain"),legend.position ="none",panel.grid.major =element_blank(),panel.grid.major.x =element_line(color ="lightgray", size =0.5),panel.grid.minor =element_blank() )```Digital channels account for the majority of the costs, representing 40% of the total delivered cost. Notably, call center costs are slightly higher than sales rep costs, suggesting that their smaller volumes are inflating the costs.```{r}# Summarize by ORDER_TYPE and FLEET_TYPE using delivered volumedata_summary_fleet_by_order <- full_data %>%filter(!is.na(FLEET_TYPE), !is.na(ORDER_TYPE)) %>%group_by(ORDER_TYPE, FLEET_TYPE) %>%summarise(TotalDelivered =sum(DELIVERED_CASES + DELIVERED_GALLONS, na.rm =TRUE), .groups ="drop") %>%group_by(ORDER_TYPE) %>%mutate(Percentage =round(TotalDelivered /sum(TotalDelivered) *100, 0))# Order ORDER_TYPE by total delivered volumeorder_levels <- data_summary_fleet_by_order %>%group_by(ORDER_TYPE) %>%summarise(Total =sum(TotalDelivered), .groups ="drop") %>%arrange(Total) %>%pull(ORDER_TYPE)# Reorder as factordata_summary_fleet_by_order$ORDER_TYPE <-factor(data_summary_fleet_by_order$ORDER_TYPE, levels = order_levels)# Plotggplot(data_summary_fleet_by_order, aes(x = TotalDelivered, y = ORDER_TYPE, fill = FLEET_TYPE)) +geom_bar(stat ="identity", position ="stack", alpha =0.6) +geom_text(aes(label =paste0(Percentage, "%")), position =position_stack(vjust =0.5), hjust =0, color ="black", size =3.2) +labs(title ="400 gallons threshold X Delivered Volume by Order Type", x ="Volume (units)", y =NULL, fill ="Fleet Type") +scale_x_continuous(labels =function(x) paste0(x /1e6, "M"),breaks =c(2500000, 5000000, 7500000, 10000000),expand =expansion(c(0, 0.05)) ) +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.title.x =element_text(size =10, face ="plain"),legend.position ="right",legend.direction ="vertical",panel.grid.major.y =element_blank(),panel.grid.major.x =element_line(color ="lightgray", size =0.5),panel.grid.minor =element_blank() )```Sales Rep had the highest internal percentage of customers (62%) who would be served by red trucks if the 400-gallon threshold were applied. On the other hand, Call Center showed the highest percentage of customers who would be served by white trucks.### 4.5 Channel Types```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Summarize data by COLD_DRINK_CHANNEL and LOCAL_FOUNT_ONLY#data_summary_cold_drink_channel <- full_data %>%# group_by(COLD_DRINK_CHANNEL, LOCAL_FOUNT_ONLY) %>%# summarise(Count = n(), .groups = 'drop') %>%# mutate(Percentage = round(Count / sum(Count) * 100, 1))# Print the summary table#data_summary_cold_drink_channel# Define the custom color palette for COLD_DRINK_CHANNEL with unique colorscold_drink_channel_colors <-c("DINING"="#A7ADC6", "PUBLIC SECTOR"="#FF6347", "EVENT"="#B33951", "WORKPLACE"="#ABD2FA", "ACCOMMODATION"="#E377C2", "GOODS"="#FFD700", "BULK TRADE"="#8ED081", "WELLNESS"="#20B2AA", "CONVENTIONAL"="#1F77B4")```More than 50% of transactions were made through the DINING channel, followed by GOODS (16.6%), EVENTS (9.2%), and BULK TRADE (8.4%). The remaining channels each represent less than 5% of the total.Transactions for Local Partners Fountain Only are almost entirely concentrated in DINING, with 2.7% of transactions compared to 47.8% for other channels.```{r}# Calculate the frequency of each COLD_DRINK_CHANNELdata_summary_cold_drink_channel <- full_data %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Count =n(), .groups ='drop') %>%mutate(Percentage =round(Count /sum(Count) *100, 1))# Create a horizontal bar chart with percentages for COLD_DRINK_CHANNELggplot(data_summary_cold_drink_channel, aes(x = Count, y =reorder(COLD_DRINK_CHANNEL, Count), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =ifelse(!is.na(Percentage), paste(Percentage, "%"), "")), # Only display text if Percentage is not NAposition =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="Percentage of Transactions by Cold Drink Channel",x =NULL, y =NULL) +scale_x_continuous(labels =NULL, expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Use your custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), legend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of gallons and casesdata_summary <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Volume /sum(Total_Volume) *100, 1)) # Calculate the percentage# Create a horizontal bar chart for the percentage of total volume by cold drink channelggplot(data_summary, aes(x = Total_Volume /1e6, y =reorder(COLD_DRINK_CHANNEL, Total_Volume), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.7) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="Percentage of Total Volume (Gallons and Cases) by Cold Drink Channel",x ="Quantity in Millions", y =NULL) +scale_x_continuous(labels =function(x) paste0(x, "M"),breaks =seq(0, 10, by =2.5),expand =expansion(c(0, 0.05)) ) +geom_vline(xintercept =c(2.5, 5, 7.5, 10), color ="lightgray", linetype ="solid", linewidth =0.3) +scale_fill_manual(values = cold_drink_channel_colors) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"), axis.text.y =element_text(size =10), axis.text.x =element_text(size =10), axis.title.x =element_text(size =10, face ="bold"),legend.position ="none", panel.grid.major =element_blank(), panel.grid.minor =element_blank() )```Dining was the segment with the highest total consumption, accounting for 27% of the total, followed by Bulk Trade with 25.8% and Workplace with 13.4%. The following section analyzes the information separately by packaging type (cases and gallons) and customer type.```{r}# Summarize data by COLD_DRINK_CHANNEL and FLEET_TYPE excluding "CONVENTIONAL"data_summary <- full_data_customer %>%filter(COLD_DRINK_CHANNEL !="CONVENTIONAL") %>%# Exclude "CONVENTIONAL" channelgroup_by(COLD_DRINK_CHANNEL, FLEET_TYPE) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ='drop' ) %>%# Calculate the percentage of each Fleet Type within each Cold Drink Channelgroup_by(COLD_DRINK_CHANNEL) %>%mutate(Percentage = Total_Volume /sum(Total_Volume) *100) %>%ungroup()# Create the horizontal bar plotggplot(data_summary, aes(x = Total_Volume, y =reorder(COLD_DRINK_CHANNEL, Total_Volume), fill = FLEET_TYPE)) +geom_bar(stat ="identity", position ="stack", alpha =0.7) +geom_text(aes(label =paste0(round(Percentage), "%")), position =position_stack(vjust =0.5), hjust =0.4, color ="black", size =3.2) +# Round percentages and remove decimal placeslabs(title ="400 gallons Threshold - Total Volume by Cold Drink Channel",x ="Total Volume (in Millions)", y =NULL) +scale_x_continuous(labels = scales::comma_format(scale =1e-6, suffix ="M"), # Convert axis to millionsbreaks =seq(2500000, 10000000, by =2500000)) +# Define custom x-axis breaksscale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +# Custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_text(size =10), # X-axis title sizeaxis.text.x =element_text(size =10), # X-axis text sizelegend.position ="bottom", # Position legend below the plotlegend.box ="horizontal", # Display legend items horizontallypanel.grid.major =element_blank(), panel.grid.minor =element_blank()) +# Add vertical lines at specific breaks on the x-axisgeom_vline(xintercept =c(2500000, 5000000, 7500000, 10000000), color ="gray", linetype ="solid", size =0.5)# Summarize data by COLD_DRINK_CHANNEL and FLEET_TYPEdata_summary <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL, FLEET_TYPE) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ='drop' ) %>%# Calculate the percentage of each Fleet Type within each Cold Drink Channelgroup_by(COLD_DRINK_CHANNEL) %>%mutate(Percentage = Total_Volume /sum(Total_Volume) *100) %>%ungroup()# Create the table#kable(data_summary, format = "markdown", digits = 1, caption = "Total Volume and Percentage by Cold Drink Channel and Fleet Type")```Above are the percentage representations of the volume that would be served by red and white trucks for the 400-gallon threshold. The majority of the volumes would be delivered by red trucks. The "CONVENTIONAL" segment was not displayed due to its extremely low volume, which would overlap with the labels. In this segment, the proportion is 47% for white trucks and 53% for red trucks.#### 4.5.1 Cold Drink Channel - Delivered Cases for All CustomersBelow are the percentages of cases delivered in 2023 and 2024 for all customers by cold drink channel.```{r}# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of cases (QTD_DLV_CA_2023 and QTD_DLV_CA_2024)data_summary_cases <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Cases =sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Cases /sum(Total_Cases) *100, 1)) # Calculate the percentage# Create a bar chart for the percentage of total cases by cold drink channelggplot(data_summary_cases, aes(x = Total_Cases, y =reorder(COLD_DRINK_CHANNEL, Total_Cases), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="All Customers - Percentage of Cases (23 & 24) by Cold Drink Channel",x ="Percentage of Total Cases", y =NULL) +scale_x_continuous(labels = scales::percent_format(scale =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), # Remove the x-axis titleaxis.text.x =element_blank(), # Remove the x-axis textlegend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())# Summarize data by COLD_DRINK_CHANNEL, summing the delivery cost for cases (COST_CA_23 and COST_CA_24)data_summary_cases_cost <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Cases_Cost =sum(COST_CA_23, na.rm =TRUE) +sum(COST_CA_24, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Cases_Cost /sum(Total_Cases_Cost) *100, 1)) # Calculate the percentage# Create a bar chart for the percentage of total cases cost by cold drink channelggplot(data_summary_cases_cost, aes(x = Total_Cases_Cost, y =reorder(COLD_DRINK_CHANNEL, Total_Cases_Cost), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="All Customers - Percentage of Cases Delivery Cost (23 & 24) by Cold Drink Channel",x ="Percentage of Total Cases Cost", y =NULL) +scale_x_continuous(labels = scales::percent_format(scale =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), # Remove the x-axis titleaxis.text.x =element_blank(), # Remove the x-axis textlegend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())```The main segment receiving cases (bottles, cans, etc.) was Bulk Trade with 33%, followed by Workplace with 17%, and Dining with 14.6%. On the other hand, the segment that presented the highest delivery costs for cases was Dining, accounting for 34% of the cost in 2023 and 2024, followed by Goods at 21%, and Bulk Trade at 16%.The tables below aim to provide detailed information within this group regarding the number of customers, costs, quartile divisions, and other relevant factors.```{r}# Calculate Total Cases, COST_CA, N.Customers, Perct.Customers, AVG_Qtd, and Median.Qtd, then order and format the tablefull_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Cases =sum(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),COST_CA =sum(COST_CA_23) +sum(COST_CA_24),# Count only customers where Total_Cases > 0N_Customers =n_distinct(CUSTOMER_NUMBER[QTD_DLV_CA_2023 + QTD_DLV_CA_2024 >0]), # Calculate the total cases per customer, excluding customers with zero total casesTotal_Cases_Per_Customer =list(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),Total_Cost_Per_Customer =list(COST_CA_23 + COST_CA_24),.groups ='drop' ) %>%mutate(# Calculate the average COST_CA per Total_CasesAVG_Cost_CA = COST_CA / Total_Cases, # Calculate the percentage of total casesPERCT_CASE =round(Total_Cases /sum(Total_Cases) *100, 1),# Calculate the percentage of total customersPerct_Customers =round(N_Customers /sum(N_Customers) *100, 1), # Calculate percentage of customers# Calculate the average cases per customer (without decimals)AVG_Qtd =round(Total_Cases / N_Customers), # No decimals for AVG_Qtd# Calculate the median of cases per customer, excluding customers with zero casesMedian_Qtd =sapply(Total_Cases_Per_Customer, function(x) {median(x[x >0], na.rm =TRUE) # Only consider positive cases for the median }),# Calculate the median cost per case for each cold drink channel, excluding customers with zero casesMedian_Cost =sapply(1:length(Total_Cases_Per_Customer), function(i) { total_cost <- Total_Cost_Per_Customer[[i]] total_cases <- Total_Cases_Per_Customer[[i]]median(total_cost[total_cases >0] / total_cases[total_cases >0], na.rm =TRUE) # Median cost per case }) ) %>%# Order by Total Cases in descending order (before formatting)arrange(desc(Total_Cases)) %>%# Calculate Opt_Cost for each COLD_DRINK_CHANNEL based on minimum median delivery cost for CASESleft_join( cost_data %>%filter(grepl("CASES", as.character(`RANGE_LEVEL`))) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Opt_Cost =round(min(`Median Delivery Cost`), 2)) %>%ungroup(), # Ensures only 1 line per COLD_DRINK_CHANNELby ="COLD_DRINK_CHANNEL" ) %>%# Format COST_CA, Total_Cases, AVG_Cost_CA, N_Customers, Perct_Customers, AVG_Qtd, Median.Qtd, and Median.Cost after orderingmutate(COST_CA = scales::comma(COST_CA), Total_Cases = scales::comma(Total_Cases), AVG_Cost_CA = scales::comma(AVG_Cost_CA, accuracy =0.01),N_Customers = scales::comma(N_Customers), # Format N_CustomersPERCT_CASE =sprintf("%.1f", PERCT_CASE), # Ensure 1 decimal place for percentagePerct_Customers =sprintf("%.1f", Perct_Customers), # Ensure 1 decimal place for percentageAVG_Qtd = scales::comma(AVG_Qtd), # Format AVG_QtdMedian_Qtd = scales::comma(Median_Qtd), # Format Median_QtdMedian_Cost = scales::comma(Median_Cost, accuracy =0.01), # Format Median_CostOpt_Cost = scales::comma(Opt_Cost, accuracy =0.01) # Format Opt_Cost ) %>%# Select columns in the correct order with exact column names dplyr::select( COLD_DRINK_CHANNEL, Total_Cases, PERCT_CASE, COST_CA, N_Customers, Perct_Customers, AVG_Qtd, Median_Qtd, AVG_Cost_CA, Median_Cost, Opt_Cost ) %>%# Rename columns to match the desired outputrename(`Channel`= COLD_DRINK_CHANNEL,`T.Cases`= Total_Cases,`Cases %`= PERCT_CASE,`T.Cost $`= COST_CA,`N.Cust`= N_Customers,`P.Cust %`= Perct_Customers,`Avg.Qtd.Cust`= AVG_Qtd,`Median.Qtd.Cust`= Median_Qtd,`Avg.Cost.Cust $`= AVG_Cost_CA,`Med.Cost.Cust $`= Median_Cost,`Opt.Cost $`= Opt_Cost ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "T.Cases", "Cases %", "T.Cost $", "N.Cust", "P.Cust %", "Avg.Qtd.Cust", "Median.Qtd.Cust", "Avg.Cost.Cust $", "Med.Cost.Cust $", "Opt.Cost $")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:11, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#ADD8E6") %>%add_header_above(c("CASES (23 & 24) - Deliveries by Cold Drink Channel - All Customers"=11)) %>%kable_paper("striped", full_width =FALSE)############# Calculate Quartiles, Customer Count, and Volume Distributionfull_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(# Store the total cases per customer, excluding zero valuesCases_Per_Customer =list(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),.groups ='drop' ) %>%mutate(# Calculate the average and median cases per customer`Avg.Qtd.Cust`=sapply(Cases_Per_Customer, function(x) mean(x[x >0])),`Median.Qtd.Cust`=sapply(Cases_Per_Customer, function(x) median(x[x >0])),# Compute quartiles for quantity`1Quart.Qtd`=sapply(Cases_Per_Customer, function(x) quantile(x[x >0], 0.25)),`2Quart.Qtd`=sapply(Cases_Per_Customer, function(x) quantile(x[x >0], 0.50)), # Median (Q2)`3Quart.Qtd`=sapply(Cases_Per_Customer, function(x) quantile(x[x >0], 0.75)) ) %>%rowwise() %>%# Ensure calculations are row-wise based on quartile valuesmutate(# Extract case values from the listCase_Values =list(unlist(Cases_Per_Customer)),# Calculate total cases volume per quartile using the correct conditions`1Quart.Vol`=sum(Case_Values[which(Case_Values >0& Case_Values <=`1Quart.Qtd`)]),`2Quart.Vol`=sum(Case_Values[which(Case_Values >`1Quart.Qtd`& Case_Values <=`2Quart.Qtd`)]),`3Quart.Vol`=sum(Case_Values[which(Case_Values >`2Quart.Qtd`& Case_Values <=`3Quart.Qtd`)]),`4Quart.Vol`=sum(Case_Values[which(Case_Values >`3Quart.Qtd`)]),# Calculate the total volume for the quartiles (1 to 4) in each channelTotal_Vol =`1Quart.Vol`+`2Quart.Vol`+`3Quart.Vol`+`4Quart.Vol`,# Calculate percentages based on the sum of volumes from all quartiles for each channel`1Q.Vol%`=round((`1Quart.Vol`/ Total_Vol) *100, 1),`2Q.Vol%`=round((`2Quart.Vol`/ Total_Vol) *100, 1),`3Q.Vol%`=round((`3Quart.Vol`/ Total_Vol) *100, 1),`4Q.Vol%`=round((`4Quart.Vol`/ Total_Vol) *100, 1) ) %>%ungroup() %>%# Remove row-wise grouping# Order by Avg.Qtd.Cust in descending orderarrange(desc(`Avg.Qtd.Cust`)) %>%# Format numbers for readabilitymutate(`Avg.Qtd.Cust`= scales::comma(`Avg.Qtd.Cust`, accuracy =1),`Median.Qtd.Cust`= scales::comma(`Median.Qtd.Cust`, accuracy =1),`1Quart.Qtd`= scales::comma(`1Quart.Qtd`, accuracy =1),`2Quart.Qtd`= scales::comma(`2Quart.Qtd`, accuracy =1),`3Quart.Qtd`= scales::comma(`3Quart.Qtd`, accuracy =1),`1Quart.Vol`= scales::comma(`1Quart.Vol`, accuracy =1),`2Quart.Vol`= scales::comma(`2Quart.Vol`, accuracy =1),`3Quart.Vol`= scales::comma(`3Quart.Vol`, accuracy =1),`4Quart.Vol`= scales::comma(`4Quart.Vol`, accuracy =1),`1Q.Vol%`=formatC(`1Q.Vol%`, format ="f", digits =1),`2Q.Vol%`=formatC(`2Q.Vol%`, format ="f", digits =1),`3Q.Vol%`=formatC(`3Q.Vol%`, format ="f", digits =1),`4Q.Vol%`=formatC(`4Q.Vol%`, format ="f", digits =1) ) %>%# Select only required columns dplyr::select( COLD_DRINK_CHANNEL, `Avg.Qtd.Cust`, `Median.Qtd.Cust`, `1Quart.Qtd`, `2Quart.Qtd`, `3Quart.Qtd`, `1Quart.Vol`, `1Q.Vol%`, `2Quart.Vol`, `2Q.Vol%`, `3Quart.Vol`, `3Q.Vol%`, `4Quart.Vol`, `4Q.Vol%` ) %>%# Rename columnsrename(`Channel`= COLD_DRINK_CHANNEL ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "Avg.Qtd.Cust", "Median.Qtd.Cust", "1Quart.Qtd", "2Quart.Qtd", "3Quart.Qtd", "1Quart.Vol", "1Q.Vol%", "2Quart.Vol", "2Q.Vol%", "3Quart.Vol", "3Q.Vol%", "4Quart.Vol", "4Q.Vol%")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:14, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#ADD8E6") %>%add_header_above(c("CASES (23 & 24) - Quartile Analysis by Cold Drink Channel - All Customers"=14)) %>%kable_paper("striped", full_width =FALSE)```The tables above can be used for different analyses, which will not be discussed here. It is worth highlighting that the bulk trade sector has a high number of outliers, which cause its annual volume average to be very high, while the median is about 5 times lower. This impact can also be observed in the delivery costs.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}#### Cases - All Customers >=400#### Cases - All Customers >=400# Filter rows before grouping, to only consider deliveries with total >= 400# full_data_customer %>%# filter(QTD_DLV_CA_2023 + QTD_DLV_CA_2024 >= 400) %>%# group_by(COLD_DRINK_CHANNEL) %>%# summarise(# Total_Cases = sum(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),# COST_CA = sum(COST_CA_23) + sum(COST_CA_24),# # Count only customers where Total_Cases > 0# N_Customers = n_distinct(CUSTOMER_NUMBER[QTD_DLV_CA_2023 + QTD_DLV_CA_2024 > 0]), # # Calculate total cases per customer, excluding customers with zero cases# Total_Cases_Per_Customer = list(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),# Total_Cost_Per_Customer = list(COST_CA_23 + COST_CA_24),# .groups = 'drop'# ) %>%# mutate(# # Calculate average cost per case# AVG_Cost_CA = COST_CA / Total_Cases, # # Calculate percentage of total cases# PERCT_CASE = round(Total_Cases / sum(Total_Cases) * 100, 1),# # Calculate percentage of total customers# Perct_Customers = round(N_Customers / sum(N_Customers) * 100, 1),# # Calculate average cases per customer (without decimals)# AVG_Qtd = round(Total_Cases / N_Customers), # No decimals for AVG_Qtd# # Calculate median cases per customer, excluding customers with zero cases# Median_Qtd = sapply(Total_Cases_Per_Customer, function(x) {# median(x[x > 0], na.rm = TRUE) # Consider only positive cases for the median# }),# # Calculate median cost per case for each cold drink channel, excluding customers with zero cases# Median_Cost = sapply(1:length(Total_Cases_Per_Customer), function(i) {# total_cost <- Total_Cost_Per_Customer[[i]]# total_cases <- Total_Cases_Per_Customer[[i]]# median(total_cost[total_cases > 0] / total_cases[total_cases > 0], na.rm = TRUE) # Median cost per case# })# ) %>%# # Sort by total cases in descending order# arrange(desc(Total_Cases)) %>%# # Calculate the optimal cost for each cold drink channel based on the lowest average delivery cost# left_join(# cost_data %>%# filter(grepl("CASES", as.character(`RANGE_LEVEL`))) %>%# group_by(COLD_DRINK_CHANNEL) %>%# summarise(Opt_Cost = round(min(`Median Delivery Cost`), 2)) %>%# ungroup(), # Ensure only one row per cold drink channel# by = "COLD_DRINK_CHANNEL"# ) %>%# # Format value columns# mutate(# COST_CA = scales::comma(COST_CA), # Total_Cases = scales::comma(Total_Cases), # AVG_Cost_CA = scales::comma(AVG_Cost_CA, accuracy = 0.01),# N_Customers = scales::comma(N_Customers), # Format N_Customers# PERCT_CASE = sprintf("%.1f", PERCT_CASE), # Ensure 1 decimal place for percentage# Perct_Customers = sprintf("%.1f", Perct_Customers), # Ensure 1 decimal place for percentage# AVG_Qtd = scales::comma(AVG_Qtd), # Format AVG_Qtd# Median_Qtd = scales::comma(Median_Qtd), # Format Median_Qtd# Median_Cost = scales::comma(Median_Cost, accuracy = 0.01), # Format Median_Cost# Opt_Cost = scales::comma(Opt_Cost, accuracy = 0.01) # Format Opt_Cost# ) %>%# # Select the required columns in the correct order with exact column names# dplyr::select(# COLD_DRINK_CHANNEL, Total_Cases, PERCT_CASE, COST_CA, N_Customers, Perct_Customers, AVG_Qtd, Median_Qtd, AVG_Cost_CA, Median_Cost, Opt_Cost# ) %>%# # Rename columns to match the desired format# rename(# `Channel` = COLD_DRINK_CHANNEL,# `T.Cases` = Total_Cases,# `Cases %` = PERCT_CASE,# `T.Cost $` = COST_CA,# `N.Cust` = N_Customers,# `P.Cust %` = Perct_Customers,# `Avg.Qtd.Cust` = AVG_Qtd,# `Median.Qtd.Cust` = Median_Qtd,# `Avg.Cost.Cust $` = AVG_Cost_CA,# `Med.Cost.Cust $` = Median_Cost,# `Opt.Cost $` = Opt_Cost# ) %>%# kable("html", escape = FALSE, align = "c", col.names = c("Channel", "T.Cases", "Cases %", "T.Cost $", "N.Cust", "P.Cust %", "Avg.Qtd.Cust", "Median.Qtd.Cust", "Avg.Cost.Cust $", "Med.Cost.Cust $", "Opt.Cost $")) %>%# kable_styling(full_width = FALSE, position = "center") %>%# column_spec(1, bold = TRUE) %>%# column_spec(2:11, width = "6em") %>%# row_spec(0, bold = TRUE, color = "white", background = "darkblue") %>%# add_header_above(c("CASES (23 & 24) - Deliveries by Cold Drink Channel - All Customers >=400 CASE" = 11)) %>%# kable_paper("striped", full_width = FALSE)############# Calculate Quartiles, Customer Count, and Volume Distribution# full_data_customer %>%# # Filter rows before grouping to consider only deliveries with total >= 400# filter(QTD_DLV_CA_2023 + QTD_DLV_CA_2024 >= 400) %>%# group_by(COLD_DRINK_CHANNEL) %>%# summarise(# # Store the total cases per customer, excluding zero values# Cases_Per_Customer = list(QTD_DLV_CA_2023 + QTD_DLV_CA_2024),# .groups = 'drop'# ) %>%# mutate(# # Calculate average and median cases per customer# `Avg.Qtd.Cust` = sapply(Cases_Per_Customer, function(x) mean(x[x > 0])),# `Median.Qtd.Cust` = sapply(Cases_Per_Customer, function(x) median(x[x > 0])),# # Compute quartiles for quantity# `1Quart.Qtd` = sapply(Cases_Per_Customer, function(x) quantile(x[x > 0], 0.25)),# `2Quart.Qtd` = sapply(Cases_Per_Customer, function(x) quantile(x[x > 0], 0.50)), # Median (Q2)# `3Quart.Qtd` = sapply(Cases_Per_Customer, function(x) quantile(x[x > 0], 0.75))# ) %>%# rowwise() %>% # Ensure calculations are row-wise based on quartile values# mutate(# # Extract case values from the list# Case_Values = list(unlist(Cases_Per_Customer)),# # Calculate total cases volume per quartile using the correct conditions# `1Quart.Vol` = sum(Case_Values[which(Case_Values > 0 & Case_Values <= `1Quart.Qtd`)]),# `2Quart.Vol` = sum(Case_Values[which(Case_Values > `1Quart.Qtd` & Case_Values <= `2Quart.Qtd`)]),# `3Quart.Vol` = sum(Case_Values[which(Case_Values > `2Quart.Qtd` & Case_Values <= `3Quart.Qtd`)]),# `4Quart.Vol` = sum(Case_Values[which(Case_Values > `3Quart.Qtd`)]),# # Calculate the total volume for the quartiles (1 to 4) in each channel# Total_Vol = `1Quart.Vol` + `2Quart.Vol` + `3Quart.Vol` + `4Quart.Vol`,# # Calculate percentages based on the sum of volumes from all quartiles for each channel# `1Q.Vol%` = round((`1Quart.Vol` / Total_Vol) * 100, 1),# `2Q.Vol%` = round((`2Quart.Vol` / Total_Vol) * 100, 1),# `3Q.Vol%` = round((`3Quart.Vol` / Total_Vol) * 100, 1),# `4Q.Vol%` = round((`4Quart.Vol` / Total_Vol) * 100, 1)# ) %>%# ungroup() %>% # Remove row-wise grouping# # Order by Avg.Qtd.Cust in descending order# arrange(desc(`Avg.Qtd.Cust`)) %>%# # Format numbers for readability# mutate(# `1Quart.Vol` = scales::comma(`1Quart.Vol`),# `2Quart.Vol` = scales::comma(`2Quart.Vol`),# `3Quart.Vol` = scales::comma(`3Quart.Vol`),# `4Quart.Vol` = scales::comma(`4Quart.Vol`),# `1Q.Vol%` = sprintf("%.1f", `1Q.Vol%`),# `2Q.Vol%` = sprintf("%.1f", `2Q.Vol%`),# `3Q.Vol%` = sprintf("%.1f", `3Q.Vol%`),# `4Q.Vol%` = sprintf("%.1f", `4Q.Vol%`)# ) %>%# # Select columns for output# select(# COLD_DRINK_CHANNEL, `Avg.Qtd.Cust`, `Median.Qtd.Cust`, `1Quart.Qtd`, `2Quart.Qtd`, `3Quart.Qtd`,# `1Quart.Vol`, `2Quart.Vol`, `3Quart.Vol`, `4Quart.Vol`, `1Q.Vol%`, `2Q.Vol%`, `3Q.Vol%`, `4Q.Vol%`# ) %>%# rename(# `Channel` = COLD_DRINK_CHANNEL,# `Avg.Qtd.Cust` = `Avg.Qtd.Cust`,# `Median.Qtd.Cust` = `Median.Qtd.Cust`,# `1st.Quart.Qtd` = `1Quart.Qtd`,# `2nd.Quart.Qtd` = `2Quart.Qtd`,# `3rd.Quart.Qtd` = `3Quart.Qtd`,# `1st.Quart.Vol` = `1Quart.Vol`,# `2nd.Quart.Vol` = `2Quart.Vol`,# `3rd.Quart.Vol` = `3Quart.Vol`,# `4th.Quart.Vol` = `4Quart.Vol`,# `1st.Quart.Vol%` = `1Q.Vol%`,# `2nd.Quart.Vol%` = `2Q.Vol%`,# `3rd.Quart.Vol%` = `3Q.Vol%`,# `4th.Quart.Vol%` = `4Q.Vol%`# ) %>%# kable("html", escape = FALSE, align = "c", col.names = c("Channel", "Avg.Qtd.Cust", "Median.Qtd.Cust", "1st.Quart.Qtd", "2nd.Quart.Qtd", "3rd.Quart.Qtd", "1st.Quart.Vol", "2nd.Quart.Vol", "3rd.Quart.Vol", "4th.Quart.Vol", "1st.Quart.Vol%", "2nd.Quart.Vol%", "3rd.Quart.Vol%", "4th.Quart.Vol%")) %>%# kable_styling(full_width = FALSE, position = "center") %>%# column_spec(1, bold = TRUE) %>%# column_spec(2:14, width = "6em") %>%# row_spec(0, bold = TRUE, color = "white", background = "darkblue") %>%# add_header_above(c("CASES (23 & 24) - Deliveries by Cold Drink Channel - Quartile Analysis" = 14)) %>%# kable_paper("striped", full_width = FALSE)```#### 4.5.2 Cold Drink Channel - Delivered Gallons for All Customers```{r}# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of gallons (QTD_DLV_GAL_2023 and QTD_DLV_GAL_2024)data_summary_gallons <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Gallons =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Gallons /sum(Total_Gallons) *100, 1)) # Calculate the percentage# Create a bar chart for the percentage of total gallons by cold drink channelggplot(data_summary_gallons, aes(x = Total_Gallons, y =reorder(COLD_DRINK_CHANNEL, Total_Gallons), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="All Customers - Percentage of Gallons (23 & 24) by Cold Drink Channel",x ="Percentage of Total Gallons", y =NULL) +scale_x_continuous(labels = scales::percent_format(scale =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), # Remove the x-axis titleaxis.text.x =element_blank(), # Remove the x-axis textlegend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())# Summarize data by COLD_DRINK_CHANNEL, summing the delivery cost for gallons (COST_GAL_23 and COST_GAL_24)data_summary_gallons_cost <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Gallons_Cost =sum(COST_GAL_23, na.rm =TRUE) +sum(COST_GAL_24, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Gallons_Cost /sum(Total_Gallons_Cost) *100, 1)) # Calculate the percentage# Create a bar chart for the percentage of total gallons cost by cold drink channelggplot(data_summary_gallons_cost, aes(x = Total_Gallons_Cost, y =reorder(COLD_DRINK_CHANNEL, Total_Gallons_Cost), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="All Customers - Percentage of Gallons Delivery Cost (23 & 24) by Cold Drink Channel",x ="Percentage of Total Gallons Cost", y =NULL) +scale_x_continuous(labels = scales::percent_format(scale =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), # Remove the x-axis titleaxis.text.x =element_blank(), # Remove the x-axis textlegend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())```For gallons, the dining segment is the most representative, accounting for 61% of the volume delivered in 2023 and 2024, and 73% of the cost of gallons. The second segment is events, with 18.7% (10% of the cost), followed by bulk trade with 6.5% (3% of the cost).The tables below aim to provide detailed information within this group regarding the number of customers, costs, quartile divisions, and other relevant factors.```{r}# Calculate Total Gallons, COST_GAL, N.Customers, Perct.Customers, AVG_Qtd, and Median.Qtd, then order and format the tablefull_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Gallons =sum(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),COST_GAL =sum(COST_GAL_23) +sum(COST_GAL_24),# Count only customers where Total_Gallons > 0N_Customers =n_distinct(CUSTOMER_NUMBER[QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024 >0]), # Calculate the total gallons per customer, excluding customers with zero total gallonsTotal_Gallons_Per_Customer =list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),Total_Cost_Per_Customer =list(COST_GAL_23 + COST_GAL_24),.groups ='drop' ) %>%mutate(# Calculate the average COST_GAL per Total_GallonsAVG_Cost_GAL = COST_GAL / Total_Gallons, # Calculate the percentage of total gallonsPERCT_GAL =round(Total_Gallons /sum(Total_Gallons) *100, 1),# Calculate the percentage of total customersPerct_Customers =round(N_Customers /sum(N_Customers) *100, 1), # Calculate percentage of customers# Calculate the average gallons per customer (without decimals)AVG_Qtd =round(Total_Gallons / N_Customers), # No decimals for AVG_Qtd# Calculate the median of gallons per customer, excluding customers with zero gallonsMedian_Qtd =sapply(Total_Gallons_Per_Customer, function(x) {median(x[x >0], na.rm =TRUE) # Only consider positive gallons for the median }),# Calculate the median cost per gallon for each cold drink channel, excluding customers with zero gallonsMedian_Cost =sapply(1:length(Total_Gallons_Per_Customer), function(i) { total_cost <- Total_Cost_Per_Customer[[i]] total_gallons <- Total_Gallons_Per_Customer[[i]]median(total_cost[total_gallons >0] / total_gallons[total_gallons >0], na.rm =TRUE) # Median cost per gallon }) ) %>%# Order by Total Gallons in descending order (before formatting)arrange(desc(Total_Gallons)) %>%# Calculate Opt_Cost for each COLD_DRINK_CHANNEL based on minimum median delivery cost for GALLONSleft_join( cost_data %>%filter(grepl("GALLONS", as.character(`RANGE_LEVEL`))) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Opt_Cost =round(min(`Median Delivery Cost`), 2)) %>%ungroup(), # Ensures only 1 line per COLD_DRINK_CHANNELby ="COLD_DRINK_CHANNEL" ) %>%# Format COST_GAL, Total_Gallons, AVG_Cost_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median.Qtd, and Median.Cost after orderingmutate(COST_GAL = scales::comma(COST_GAL), Total_Gallons = scales::comma(Total_Gallons), AVG_Cost_GAL = scales::comma(AVG_Cost_GAL, accuracy =0.01),N_Customers = scales::comma(N_Customers), # Format N_CustomersPERCT_GAL =sprintf("%.1f", PERCT_GAL), # Ensure 1 decimal place for percentagePerct_Customers =sprintf("%.1f", Perct_Customers), # Ensure 1 decimal place for percentageAVG_Qtd = scales::comma(AVG_Qtd), # Format AVG_QtdMedian_Qtd = scales::comma(Median_Qtd), # Format Median_QtdMedian_Cost = scales::comma(Median_Cost, accuracy =0.01), # Format Median_CostOpt_Cost = scales::comma(Opt_Cost, accuracy =0.01) # Format Opt_Cost ) %>%# Select columns in the correct order with exact column names dplyr::select( COLD_DRINK_CHANNEL, Total_Gallons, PERCT_GAL, COST_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median_Qtd, AVG_Cost_GAL, Median_Cost, Opt_Cost ) %>%# Rename columns to match the desired outputrename(`Channel`= COLD_DRINK_CHANNEL,`T.Gallons`= Total_Gallons,`Gallons %`= PERCT_GAL,`T.Cost $`= COST_GAL,`N.Cust`= N_Customers,`P.Cust %`= Perct_Customers,`Avg.Qtd.Cust`= AVG_Qtd,`Median.Qtd.Cust`= Median_Qtd,`Avg.Cost.Cust $`= AVG_Cost_GAL,`Med.Cost.Cust $`= Median_Cost,`Opt.Cost $`= Opt_Cost ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "T.Gallons", "Gallons %", "T.Cost $", "N.Cust", "P.Cust %", "Avg.Qtd.Cust", "Median.Qtd.Cust", "Avg.Cost.Cust $", "Med.Cost.Cust $", "Opt.Cost $")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:11, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#FFCCCB") %>%add_header_above(c("GALLONS (23 & 24) - Deliveries by Cold Drink Channel - All Customers"=11)) %>%kable_paper("striped", full_width =FALSE)############# Calculate Quartiles, Customer Count, and Volume Distributionfull_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(# Store the total gallons per customer, excluding zero valuesGallons_Per_Customer =list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),.groups ='drop' ) %>%mutate(# Calculate the average and median gallons per customer`Avg.Qtd.Cust`=sapply(Gallons_Per_Customer, function(x) mean(x[x >0])),`Median.Qtd.Cust`=sapply(Gallons_Per_Customer, function(x) median(x[x >0])),# Compute quartiles for quantity`1Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.25)),`2Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.50)), # Median (Q2)`3Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.75)) ) %>%rowwise() %>%# Ensure calculations are row-wise based on quartile valuesmutate(# Extract gallon values from the listGallon_Values =list(unlist(Gallons_Per_Customer)),# Calculate total gallons volume per quartile using the correct conditions`1Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >0& Gallon_Values <=`1Quart.Qtd`)]),`2Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`1Quart.Qtd`& Gallon_Values <=`2Quart.Qtd`)]),`3Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`2Quart.Qtd`& Gallon_Values <=`3Quart.Qtd`)]),`4Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`3Quart.Qtd`)]),# Calculate the total volume for the quartiles (1 to 4) in each channelTotal_Vol =`1Quart.Vol`+`2Quart.Vol`+`3Quart.Vol`+`4Quart.Vol`,# Calculate percentages based on the sum of volumes from all quartiles for each channel`1Q.Vol%`=round((`1Quart.Vol`/ Total_Vol) *100, 1),`2Q.Vol%`=round((`2Quart.Vol`/ Total_Vol) *100, 1),`3Q.Vol%`=round((`3Quart.Vol`/ Total_Vol) *100, 1),`4Q.Vol%`=round((`4Quart.Vol`/ Total_Vol) *100, 1) ) %>%ungroup() %>%# Remove row-wise grouping# Order by Avg.Qtd.Cust in descending orderarrange(desc(`Avg.Qtd.Cust`)) %>%# Format numbers for readabilitymutate(`Avg.Qtd.Cust`= scales::comma(`Avg.Qtd.Cust`, accuracy =1),`Median.Qtd.Cust`= scales::comma(`Median.Qtd.Cust`, accuracy =1),`1Quart.Qtd`= scales::comma(`1Quart.Qtd`, accuracy =1),`2Quart.Qtd`= scales::comma(`2Quart.Qtd`, accuracy =1),`3Quart.Qtd`= scales::comma(`3Quart.Qtd`, accuracy =1),`1Quart.Vol`= scales::comma(`1Quart.Vol`, accuracy =1),`2Quart.Vol`= scales::comma(`2Quart.Vol`, accuracy =1),`3Quart.Vol`= scales::comma(`3Quart.Vol`, accuracy =1),`4Quart.Vol`= scales::comma(`4Quart.Vol`, accuracy =1),`1Q.Vol%`=formatC(`1Q.Vol%`, format ="f", digits =1),`2Q.Vol%`=formatC(`2Q.Vol%`, format ="f", digits =1),`3Q.Vol%`=formatC(`3Q.Vol%`, format ="f", digits =1),`4Q.Vol%`=formatC(`4Q.Vol%`, format ="f", digits =1) ) %>%# Select only required columns dplyr::select( COLD_DRINK_CHANNEL, `Avg.Qtd.Cust`, `Median.Qtd.Cust`, `1Quart.Qtd`, `2Quart.Qtd`, `3Quart.Qtd`, `1Quart.Vol`, `1Q.Vol%`, `2Quart.Vol`, `2Q.Vol%`, `3Quart.Vol`, `3Q.Vol%`, `4Quart.Vol`, `4Q.Vol%` ) %>%# Rename columnsrename(`Channel`= COLD_DRINK_CHANNEL ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "Avg.Qtd.Cust", "Median.Qtd.Cust", "1Quart.Qtd", "2Quart.Qtd", "3Quart.Qtd", "1Quart.Vol", "1Q.Vol%", "2Quart.Vol", "2Q.Vol%", "3Quart.Vol", "3Q.Vol%", "4Quart.Vol", "4Q.Vol%")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:14, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#FFCCCB") %>%add_header_above(c("GALLONS (23 & 24) - Quartile Analysis by Cold Drink Channel - All Customers"=14)) %>%kable_paper("striped", full_width =FALSE)```The tables above can be used for different analyses, which will not be discussed here. It is worth highlighting that the dining segment has an average consumption of 522 gallons and a median of 235, resulting in a smaller cost difference when compared to the impact of cases for the bulk trade sector.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}#### Gallons - All Customers >=400# Filter rows before grouping, to only consider deliveries with total >= 400# full_data_customer %>%# filter(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024 >= 400) %>%# group_by(COLD_DRINK_CHANNEL) %>%# summarise(# Total_Gallons = sum(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),# COST_GAL = sum(COST_GAL_23) + sum(COST_GAL_24),# # Count only customers where Total_Gallons > 0# N_Customers = n_distinct(CUSTOMER_NUMBER[QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024 > 0]), # # Calculate total gallons per customer, excluding customers with zero gallons# Total_Gallons_Per_Customer = list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),# Total_Cost_Per_Customer = list(COST_GAL_23 + COST_GAL_24),# .groups = 'drop'# ) %>%# mutate(# # Calculate average cost per gallon# AVG_Cost_GAL = COST_GAL / Total_Gallons, # # Calculate percentage of total gallons# PERCT_GAL = round(Total_Gallons / sum(Total_Gallons) * 100, 1),# # Calculate percentage of total customers# Perct_Customers = round(N_Customers / sum(N_Customers) * 100, 1),# # Calculate average gallons per customer (without decimals)# AVG_Qtd = round(Total_Gallons / N_Customers), # No decimals for AVG_Qtd# # Calculate median gallons per customer, excluding customers with zero gallons# Median_Qtd = sapply(Total_Gallons_Per_Customer, function(x) {# median(x[x > 0], na.rm = TRUE) # Consider only positive gallons for the median# }),# # Calculate median cost per gallon for each cold drink channel, excluding customers with zero gallons# Median_Cost = sapply(1:length(Total_Gallons_Per_Customer), function(i) {# total_cost <- Total_Cost_Per_Customer[[i]]# total_gallons <- Total_Gallons_Per_Customer[[i]]# median(total_cost[total_gallons > 0] / total_gallons[total_gallons > 0], na.rm = TRUE) # Median cost per gallon# })# ) %>%# # Sort by total gallons in descending order# arrange(desc(Total_Gallons)) %>%# # Calculate the optimal cost for each cold drink channel based on the lowest average delivery cost# left_join(# cost_data %>%# filter(grepl("GALLONS", as.character(`RANGE_LEVEL`))) %>%# group_by(COLD_DRINK_CHANNEL) %>%# summarise(Opt_Cost = round(min(`Median Delivery Cost`), 2)) %>%# ungroup(), # Ensure only one row per cold drink channel# by = "COLD_DRINK_CHANNEL"# ) %>%# # Format value columns# mutate(# COST_GAL = scales::comma(COST_GAL), # Total_Gallons = scales::comma(Total_Gallons), # AVG_Cost_GAL = scales::comma(AVG_Cost_GAL, accuracy = 0.01),# N_Customers = scales::comma(N_Customers), # Format N_Customers# PERCT_GAL = sprintf("%.1f", PERCT_GAL), # Ensure 1 decimal place for percentage# Perct_Customers = sprintf("%.1f", Perct_Customers), # Ensure 1 decimal place for percentage# AVG_Qtd = scales::comma(AVG_Qtd), # Format AVG_Qtd# Median_Qtd = scales::comma(Median_Qtd), # Format Median_Qtd# Median_Cost = scales::comma(Median_Cost, accuracy = 0.01), # Format Median_Cost# Opt_Cost = scales::comma(Opt_Cost, accuracy = 0.01) # Format Opt_Cost# ) %>%# # Select the required columns in the correct order with exact column names# dplyr::select(# COLD_DRINK_CHANNEL, Total_Gallons, PERCT_GAL, COST_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median_Qtd, AVG_Cost_GAL, Median_Cost, Opt_Cost# ) %>%# # Rename columns to match the desired format# rename(# `Channel` = COLD_DRINK_CHANNEL,# `T.Gallons` = Total_Gallons,# `Gallons %` = PERCT_GAL,# `T.Cost $` = COST_GAL,# `N.Cust` = N_Customers,# `P.Cust %` = Perct_Customers,# `Avg.Qtd.Cust` = AVG_Qtd,# `Median.Qtd.Cust` = Median_Qtd,# `Avg.Cost.Cust $` = AVG_Cost_GAL,# `Med.Cost.Cust $` = Median_Cost,# `Opt.Cost $` = Opt_Cost# ) %>%# kable("html", escape = FALSE, align = "c", col.names = c("Channel", "T.Gallons", "Gallons %", "T.Cost $", "N.Cust", "P.Cust %", "Avg.Qtd.Cust", "Median.Qtd.Cust", "Avg.Cost.Cust $", "Med.Cost.Cust $", "Opt.Cost $")) %>%# kable_styling(full_width = FALSE, position = "center") %>%# column_spec(1, bold = TRUE) %>%# column_spec(2:11, width = "6em") %>%# row_spec(0, bold = TRUE, color = "white", background = "darkred") %>%# add_header_above(c("GALLONS (23 & 24) - Deliveries by Cold Drink Channel - All Customers >=400 GAL" = 11)) %>%# kable_paper("striped", full_width = FALSE)############# Calculate Quartiles, Customer Count, and Volume Distribution# full_data_customer %>%# # Filter rows before grouping to consider only deliveries with total >= 400# filter(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024 >= 400) %>%# group_by(COLD_DRINK_CHANNEL) %>%# summarise(# # Store the total gallons per customer, excluding zero values# Gallons_Per_Customer = list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),# .groups = 'drop'# ) %>%# mutate(# # Calculate average and median gallons per customer# `Avg.Qtd.Cust` = sapply(Gallons_Per_Customer, function(x) mean(x[x > 0])),# `Median.Qtd.Cust` = sapply(Gallons_Per_Customer, function(x) median(x[x > 0])),# # Compute quartiles for quantity# `1Quart.Qtd` = sapply(Gallons_Per_Customer, function(x) quantile(x[x > 0], 0.25)),# `2Quart.Qtd` = sapply(Gallons_Per_Customer, function(x) quantile(x[x > 0], 0.50)), # Median (Q2)# `3Quart.Qtd` = sapply(Gallons_Per_Customer, function(x) quantile(x[x > 0], 0.75))# ) %>%# rowwise() %>% # Ensure calculations are row-wise based on quartile values# mutate(# # Extract gallon values from the list# Gallon_Values = list(unlist(Gallons_Per_Customer)),# # Calculate total gallons volume per quartile using the correct conditions# `1Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > 0 & Gallon_Values <= `1Quart.Qtd`)]),# `2Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > `1Quart.Qtd` & Gallon_Values <= `2Quart.Qtd`)]),# `3Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > `2Quart.Qtd` & Gallon_Values <= `3Quart.Qtd`)]),# `4Quart.Vol` = sum(Gallon_Values[which(Gallon_Values > `3Quart.Qtd`)]),# # Calculate the total volume for the quartiles (1 to 4) in each channel# Total_Vol = `1Quart.Vol` + `2Quart.Vol` + `3Quart.Vol` + `4Quart.Vol`,# # Calculate percentages based on the sum of volumes from all quartiles for each channel# `1Q.Vol%` = round((`1Quart.Vol` / Total_Vol) * 100, 1),# `2Q.Vol%` = round((`2Quart.Vol` / Total_Vol) * 100, 1),# `3Q.Vol%` = round((`3Quart.Vol` / Total_Vol) * 100, 1),# `4Q.Vol%` = round((`4Quart.Vol` / Total_Vol) * 100, 1)# ) %>%# ungroup() %>% # Remove row-wise grouping# # Order by Avg.Qtd.Cust in descending order# arrange(desc(`Avg.Qtd.Cust`)) %>%# # Format numbers for readability# mutate(# `Avg.Qtd.Cust` = scales::comma(`Avg.Qtd.Cust`, accuracy = 1),# `Median.Qtd.Cust` = scales::comma(`Median.Qtd.Cust`, accuracy = 1),# `1Quart.Qtd` = scales::comma(`1Quart.Qtd`, accuracy = 1),# `2Quart.Qtd` = scales::comma(`2Quart.Qtd`, accuracy = 1),# `3Quart.Qtd` = scales::comma(`3Quart.Qtd`, accuracy = 1),# `1Quart.Vol` = scales::comma(`1Quart.Vol`, accuracy = 1),# `2Quart.Vol` = scales::comma(`2Quart.Vol`, accuracy = 1),# `3Quart.Vol` = scales::comma(`3Quart.Vol`, accuracy = 1),# `4Quart.Vol` = scales::comma(`4Quart.Vol`, accuracy = 1),# `1Q.Vol%` = scales::comma(`1Q.Vol%`, accuracy = 1),# `2Q.Vol%` = scales::comma(`2Q.Vol%`, accuracy = 1),# `3Q.Vol%` = scales::comma(`3Q.Vol%`, accuracy = 1),# `4Q.Vol%` = scales::comma(`4Q.Vol%`, accuracy = 1)# ) %>%# select(# COLD_DRINK_CHANNEL, # `Avg.Qtd.Cust`, # `Median.Qtd.Cust`, # `1Quart.Qtd`, `2Quart.Qtd`, `3Quart.Qtd`,# `1Quart.Vol`, `2Quart.Vol`, `3Quart.Vol`, `4Quart.Vol`,# `1Q.Vol%`, `2Q.Vol%`, `3Q.Vol%`, `4Q.Vol%`# ) %>%# rename(# `Channel` = COLD_DRINK_CHANNEL,# `Avg.Qtd.Cust` = `Avg.Qtd.Cust`,# `Median.Qtd.Cust` = `Median.Qtd.Cust`,# `1Q.Gallons` = `1Quart.Qtd`,# `2Q.Gallons` = `2Quart.Qtd`,# `3Q.Gallons` = `3Quart.Qtd`,# `1Q.Vol` = `1Quart.Vol`,# `2Q.Vol` = `2Quart.Vol`,# `3Q.Vol` = `3Quart.Vol`,# `4Q.Vol` = `4Quart.Vol`,# `1Q.Vol%` = `1Q.Vol%`,# `2Q.Vol%` = `2Q.Vol%`,# `3Q.Vol%` = `3Q.Vol%`,# `4Q.Vol%` = `4Q.Vol%`# ) %>%# kable("html", escape = FALSE, align = "c") %>%# kable_styling(full_width = FALSE, position = "center") %>%# column_spec(1, bold = TRUE) %>%# column_spec(2:14, width = "6em") %>%# row_spec(0, bold = TRUE, color = "white", background = "darkred") %>%# add_header_above(c("Quartiles and Volumes per Cold Drink Channel" = 14)) %>%# kable_paper("striped", full_width = FALSE)``````{r, warning=FALSE}# Calculate the mean and median for the "DINING" channel, without creating a permanent columnmean_value <-mean( (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024)[ full_data_customer$COLD_DRINK_CHANNEL =="DINING"& (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024) >0 ], na.rm =TRUE)median_value <-median( (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024)[ full_data_customer$COLD_DRINK_CHANNEL =="DINING"& (full_data_customer$QTD_DLV_GAL_2023 + full_data_customer$QTD_DLV_GAL_2024) >0 ], na.rm =TRUE)# Filter data for the "DINING" channel, exclude zero sums, and plot the histogramfull_data_customer %>%filter(COLD_DRINK_CHANNEL =="DINING"& (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024) >0) %>%# Filter for "DINING" and total_gallons > 0mutate(total_gallons = QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024) %>%# Temporarily create 'total_gallons'ggplot(aes(x = total_gallons)) +# Dark gray bars with no bordergeom_histogram(binwidth =0.5, fill ="darkgray", color ="darkgray", alpha =0.7) +# Line for meangeom_vline(aes(xintercept = mean_value, color ="Mean"), linetype ="solid", size =0.6) +# Line for mean# Line for mediangeom_vline(aes(xintercept = median_value, color ="Median"), linetype ="solid", size =0.6) +# Line for median# Customize colors and legend positionscale_color_manual(values =c("Mean"="blue", "Median"="coral"),labels =c(paste("Mean:", round(mean_value, 0)),paste("Median:", round(median_value, 0)))) +labs(title ="Total Gallons Delivered for Dining Channel",subtitle ="(Limited to a Maximum of 5000)",x ="Total Gallons Delivered",y ="Number of Customers" ) +xlim(0, 5000) +# Limit x-axis to 5000theme_minimal() +# Use a minimal themetheme(panel.grid.major.y =element_line(color ="gray", size =0.5), # Add horizontal grid linespanel.grid =element_blank(), # Remove vertical grid linesaxis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(size =14, face ="bold"),plot.subtitle =element_text(size =12, face ="italic"), # Style the subtitlelegend.position ="right", # Move the legend to the rightlegend.title =element_blank(), # Remove the title from the legendlegend.key =element_blank() # Remove the background of the legend )```#### 4.5.3 Cold Drink Channel - Delivered Gallons for Local Market Partners Fountain Only```{r}# Summarize data by COLD_DRINK_CHANNEL, summing the quantities of gallons (QTD_DLV_GAL_2023 and QTD_DLV_GAL_2024), and filter by LOCAL_FOUNT_ONLY == 1data_summary_gallons <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Gallons =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Gallons /sum(Total_Gallons) *100, 1)) # Calculate the percentage# Create a bar chart for the percentage of total gallons by cold drink channelggplot(data_summary_gallons, aes(x = Total_Gallons, y =reorder(COLD_DRINK_CHANNEL, Total_Gallons), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.5) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="Local Fountain Only - Percentage of Gallons (23 & 24) by Cold Drink Channel",x ="Percentage of Total Gallons", y =NULL) +scale_x_continuous(labels = scales::percent_format(scale =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Apply the custom color palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold")) +theme(axis.text.y =element_text(size =10), axis.title.x =element_blank(), # Remove the x-axis titleaxis.text.x =element_blank(), # Remove the x-axis textlegend.position ="none", # Remove the legendpanel.grid.major =element_blank(), panel.grid.minor =element_blank())# # Summarize data by COLD_DRINK_CHANNEL, summing the delivery cost for gallons (COST_GAL_23 and COST_GAL_24), # # and filter by LOCAL_FOUNT_ONLY == 1# data_summary_gallons_cost <- full_data_customer %>%# filter(LOCAL_FOUNT_ONLY == 1) %>%# group_by(COLD_DRINK_CHANNEL) %>%# summarise(# Total_Gallons_Cost = sum(COST_GAL_23, na.rm = TRUE) + sum(COST_GAL_24, na.rm = TRUE),# .groups = 'drop'# ) %>%# mutate(Percentage = round(Total_Gallons_Cost / sum(Total_Gallons_Cost) * 100, 1)) # Calculate the percentage# # Create a bar chart for the percentage of total gallons cost by cold drink channel for LOCAL_FOUNT_ONLY# ggplot(data_summary_gallons_cost, aes(x = Total_Gallons_Cost, y = reorder(COLD_DRINK_CHANNEL, Total_Gallons_Cost), fill = COLD_DRINK_CHANNEL)) +# geom_bar(stat = "identity", position = "stack", alpha = 0.5) + # geom_text(aes(label = paste(Percentage, "%")), position = position_stack(vjust = 0.5), # hjust = -0.01, color = "black", size = 3.2) +# labs(title = "Local Fountain Only - Percentage of Gallons Delivery Cost (23 & 24) by Cold Drink Channel",# x = "Percentage of Total Gallons Cost", # y = NULL) + # scale_x_continuous(labels = scales::percent_format(scale = 1), expand = expansion(c(0, 0.05))) + # scale_fill_manual(values = cold_drink_channel_colors) + # Apply the custom color palette# theme_minimal() + # theme(plot.title = element_text(size = 10, face = "bold")) + # theme(axis.text.y = element_text(size = 10), # axis.title.x = element_blank(), # Remove the x-axis title# axis.text.x = element_blank(), # Remove the x-axis text# legend.position = "none", # Remove the legend# panel.grid.major = element_blank(), # panel.grid.minor = element_blank())```Among the local drink-only customers, nearly 90% of the demand is represented by the dining segment, followed by event at 4.5% and workplace at 3.5%. Costs followed nearly the same proportions and were therefore not displayed.The tables below aim to provide detailed information within this group regarding the number of customers, costs, quartile divisions, and other relevant factors.```{r}# Calculate Total Gallons, COST_GAL, N.Customers, Perct.Customers, AVG_Qtd, and Median.Qtd, then order and format the tablefull_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Gallons =sum(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),COST_GAL =sum(COST_GAL_23) +sum(COST_GAL_24),# Count only customers where Total_Gallons > 0N_Customers =n_distinct(CUSTOMER_NUMBER[QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024 >0]), # Calculate the total gallons per customer, excluding customers with zero total gallonsTotal_Gallons_Per_Customer =list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),Total_Cost_Per_Customer =list(COST_GAL_23 + COST_GAL_24),.groups ='drop' ) %>%mutate(# Calculate the average COST_GAL per Total_GallonsAVG_Cost_GAL = COST_GAL / Total_Gallons, # Calculate the percentage of total gallonsPERCT_GAL =round(Total_Gallons /sum(Total_Gallons) *100, 1),# Calculate the percentage of total customersPerct_Customers =round(N_Customers /sum(N_Customers) *100, 1), # Calculate percentage of customers# Calculate the average gallons per customer (without decimals)AVG_Qtd =round(Total_Gallons / N_Customers), # No decimals for AVG_Qtd# Calculate the median of gallons per customer, excluding customers with zero gallonsMedian_Qtd =sapply(Total_Gallons_Per_Customer, function(x) {median(x[x >0], na.rm =TRUE) # Only consider positive gallons for the median }),# Calculate the median cost per gallon for each cold drink channel, excluding customers with zero gallonsMedian_Cost =sapply(1:length(Total_Gallons_Per_Customer), function(i) { total_cost <- Total_Cost_Per_Customer[[i]] total_gallons <- Total_Gallons_Per_Customer[[i]]median(total_cost[total_gallons >0] / total_gallons[total_gallons >0], na.rm =TRUE) # Median cost per gallon }) ) %>%# Order by Total Gallons in descending order (before formatting)arrange(desc(Total_Gallons)) %>%# Calculate Opt_Cost for each COLD_DRINK_CHANNEL based on minimum median delivery cost for GALLONSleft_join( cost_data %>%filter(grepl("GALLONS", as.character(`RANGE_LEVEL`))) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Opt_Cost =round(min(`Median Delivery Cost`), 2)) %>%ungroup(), # Ensures only 1 line per COLD_DRINK_CHANNELby ="COLD_DRINK_CHANNEL" ) %>%# Format COST_GAL, Total_Gallons, AVG_Cost_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median.Qtd, and Median.Cost after orderingmutate(COST_GAL = scales::comma(COST_GAL), Total_Gallons = scales::comma(Total_Gallons), AVG_Cost_GAL = scales::comma(AVG_Cost_GAL, accuracy =0.01),N_Customers = scales::comma(N_Customers), # Format N_CustomersPERCT_GAL =sprintf("%.1f", PERCT_GAL), # Ensure 1 decimal place for percentagePerct_Customers =sprintf("%.1f", Perct_Customers), # Ensure 1 decimal place for percentageAVG_Qtd = scales::comma(AVG_Qtd), # Format AVG_QtdMedian_Qtd = scales::comma(Median_Qtd), # Format Median_QtdMedian_Cost = scales::comma(Median_Cost, accuracy =0.01), # Format Median_CostOpt_Cost = scales::comma(Opt_Cost, accuracy =0.01) # Format Opt_Cost ) %>%# Select columns in the correct order with exact column names dplyr::select( COLD_DRINK_CHANNEL, Total_Gallons, PERCT_GAL, COST_GAL, N_Customers, Perct_Customers, AVG_Qtd, Median_Qtd, AVG_Cost_GAL, Median_Cost, Opt_Cost ) %>%# Rename columns to match the desired outputrename(`Channel`= COLD_DRINK_CHANNEL,`T.Gallons`= Total_Gallons,`Gallons %`= PERCT_GAL,`T.Cost $`= COST_GAL,`N.Cust`= N_Customers,`P.Cust %`= Perct_Customers,`Avg.Qtd.Cust`= AVG_Qtd,`Median.Qtd.Cust`= Median_Qtd,`Avg.Cost.Cust $`= AVG_Cost_GAL,`Med.Cost.Cust $`= Median_Cost,`Opt.Cost $`= Opt_Cost ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "T.Gallons", "Gallons %", "T.Cost $", "N.Cust", "P.Cust %", "Avg.Qtd.Cust", "Median.Qtd.Cust", "Avg.Cost.Cust $", "Med.Cost.Cust $", "Opt.Cost $")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:11, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="darkorange") %>%add_header_above(c("GALLONS (23 & 24) - Deliveries by Cold Drink Channel - Local Fountain Only"=11)) %>%kable_paper("striped", full_width =FALSE)######## Calculate Quartiles, Customer Count, and Volume Distributionfull_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(# Store the total gallons per customer, excluding zero valuesGallons_Per_Customer =list(QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),.groups ='drop' ) %>%mutate(# Calculate the average and median gallons per customer`Avg.Qtd.Cust`=sapply(Gallons_Per_Customer, function(x) mean(x[x >0])),`Median.Qtd.Cust`=sapply(Gallons_Per_Customer, function(x) median(x[x >0])),# Compute quartiles for quantity`1Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.25)),`2Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.50)), # Median (Q2)`3Quart.Qtd`=sapply(Gallons_Per_Customer, function(x) quantile(x[x >0], 0.75)) ) %>%rowwise() %>%# Ensure calculations are row-wise based on quartile valuesmutate(# Extract gallon values from the listGallon_Values =list(unlist(Gallons_Per_Customer)),# Calculate total gallons volume per quartile using the correct conditions`1Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >0& Gallon_Values <=`1Quart.Qtd`)]),`2Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`1Quart.Qtd`& Gallon_Values <=`2Quart.Qtd`)]),`3Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`2Quart.Qtd`& Gallon_Values <=`3Quart.Qtd`)]),`4Quart.Vol`=sum(Gallon_Values[which(Gallon_Values >`3Quart.Qtd`)]),# Calculate the total volume for the quartiles (1 to 4) in each channelTotal_Vol =`1Quart.Vol`+`2Quart.Vol`+`3Quart.Vol`+`4Quart.Vol`,# Calculate percentages based on the sum of volumes from all quartiles for each channel`1Q.Vol%`=round((`1Quart.Vol`/ Total_Vol) *100, 1),`2Q.Vol%`=round((`2Quart.Vol`/ Total_Vol) *100, 1),`3Q.Vol%`=round((`3Quart.Vol`/ Total_Vol) *100, 1),`4Q.Vol%`=round((`4Quart.Vol`/ Total_Vol) *100, 1) ) %>%ungroup() %>%# Remove row-wise grouping# Order by Avg.Qtd.Cust in descending orderarrange(desc(`Avg.Qtd.Cust`)) %>%# Format numbers for readabilitymutate(`Avg.Qtd.Cust`= scales::comma(`Avg.Qtd.Cust`, accuracy =1),`Median.Qtd.Cust`= scales::comma(`Median.Qtd.Cust`, accuracy =1),`1Quart.Qtd`= scales::comma(`1Quart.Qtd`, accuracy =1),`2Quart.Qtd`= scales::comma(`2Quart.Qtd`, accuracy =1),`3Quart.Qtd`= scales::comma(`3Quart.Qtd`, accuracy =1),`1Quart.Vol`= scales::comma(`1Quart.Vol`, accuracy =1),`2Quart.Vol`= scales::comma(`2Quart.Vol`, accuracy =1),`3Quart.Vol`= scales::comma(`3Quart.Vol`, accuracy =1),`4Quart.Vol`= scales::comma(`4Quart.Vol`, accuracy =1),`1Q.Vol%`=formatC(`1Q.Vol%`, format ="f", digits =1),`2Q.Vol%`=formatC(`2Q.Vol%`, format ="f", digits =1),`3Q.Vol%`=formatC(`3Q.Vol%`, format ="f", digits =1),`4Q.Vol%`=formatC(`4Q.Vol%`, format ="f", digits =1) ) %>%# Select only required columns dplyr::select( COLD_DRINK_CHANNEL, `Avg.Qtd.Cust`, `Median.Qtd.Cust`, `1Quart.Qtd`, `2Quart.Qtd`, `3Quart.Qtd`, `1Quart.Vol`, `1Q.Vol%`, `2Quart.Vol`, `2Q.Vol%`, `3Quart.Vol`, `3Q.Vol%`, `4Quart.Vol`, `4Q.Vol%` ) %>%# Rename columnsrename(`Channel`= COLD_DRINK_CHANNEL ) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "Avg.Qtd.Cust", "Median.Qtd.Cust", "1Quart.Qtd", "2Quart.Qtd", "3Quart.Qtd", "1Quart.Vol", "1Q.Vol%", "2Quart.Vol", "2Q.Vol%", "3Quart.Vol", "3Q.Vol%", "4Quart.Vol", "4Q.Vol%")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:14, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="darkorange") %>%add_header_above(c("GALLONS (23 & 24) - Quartile Analysis by Cold Drink Channel - Local Fountain Only"=14)) %>%kable_paper("striped", full_width =FALSE)```The tables above can be used for different analyses, which will not be discussed here. It is worth highlighting that among the local market partners (fountain only), the average consumption was 444 gallons and the median was 177, resulting in an average cost of $2.14 per gallon, which is nearly half of the cost per gallon for customers, which is $3.98.### 4.6 Trade Channel```{r, warning=FALSE}# Summarize data by TRADE_CHANNEL, summing the quantities of gallons and casesdata_summary_trade_channel <- full_data_customer %>%group_by(TRADE_CHANNEL) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ='drop' ) %>%mutate(Percentage =round(Total_Volume /sum(Total_Volume) *100, 1))# Define a dynamic color palette to handle more than 9 categoriesnum_colors <-length(unique(data_summary_trade_channel$TRADE_CHANNEL))custom_palette <-setNames(colorRampPalette(brewer.pal(9, "Set2"))(num_colors),unique(data_summary_trade_channel$TRADE_CHANNEL))# Create a horizontal bar chart for the percentage of total volume by trade channelggplot(data_summary_trade_channel, aes(x = Total_Volume /1e6, y =reorder(TRADE_CHANNEL, Total_Volume), fill = TRADE_CHANNEL)) +geom_bar(stat ="identity", position ="stack", alpha =0.7) +geom_text(aes(label =paste(Percentage, "%")), position =position_stack(vjust =0.5),hjust =-0.01, color ="black", size =3.2) +labs(title ="Percentage of Total Volume (Gallons and Cases) by Trade Channel",x ="Quantity in Millions",y =NULL) +scale_x_continuous(limits =c(0, 7.5),breaks =c(2.5, 5, 7.5),labels =function(x) paste0(x, "M"),expand =expansion(c(0, 0.05)) ) +geom_vline(xintercept =c(2.5, 5, 7.5), color ="lightgray", linetype ="solid", linewidth =0.3) +scale_fill_manual(values = custom_palette) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.text.x =element_text(size =10),axis.title.x =element_text(size =10, face ="bold"),legend.position ="none",panel.grid.major =element_blank(),panel.grid.minor =element_blank() )```Among the trade channels, Fast Casual Dining (19%), Comprehensive Dining (13.4%), and Travel (12%) rank among the top five in terms of total volume demand. These are also the only segments that individually represent more than 10% of the total volume.### 4.7 Sub Trade ChannelThe sub trade channel consists of 48 classes, so we decided to create a table for reference and queries.```{r}# Create a summary table for the frequency of each unique value in SUB_TRADE_CHANNELdata_summary_sub_trade_channel <- profile_data %>%group_by(SUB_TRADE_CHANNEL) %>%summarise(Count =n()) %>%mutate(Percentage =round(Count /sum(Count) *100, 1)) # Display the interactive table with DTdatatable(data_summary_sub_trade_channel, options =list(pageLength =5, autoWidth =TRUE, dom ='Bfrtip', buttons =c('copy', 'csv', 'excel', 'pdf')))```### 4.8 CO2 Customers```{r}# Calculate percentages for CO2_CUSTOMERco2_customer_summary <- profile_data %>%group_by(CO2_CUSTOMER) %>%summarise(Count =n()) %>%mutate(Percentage =round(Count /sum(Count) *100, 1))# Create the plotggplot(co2_customer_summary, aes(x = CO2_CUSTOMER, y = Percentage, fill =as.factor(CO2_CUSTOMER))) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label =paste0(Percentage, "%")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +labs(title ="Percentage Breakdown by CO2 Customers Status") +scale_fill_manual(values =c("0"="#8ED081", "1"="#A7ADC6"), labels =c("Non-CO2 Customers", "CO2 Customers")) +scale_y_continuous(labels =percent_format(scale =1)) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_blank(),legend.position ="right", # Position the legend to the rightpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),axis.text.x =element_text(size =10),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Non-CO2 Customers", "1"="CO2 Customers"))```Around 61% of customers do not consume CO2, including all local market partners. However, we still find that the percentage of customers consuming CO2 is relatively high, at nearly 39%.### 4.9 Transactions by Cases```{r}# Create the summary table with adjusted minimum value and median considering values > 0summary_cases <-data.frame(type =c("ORDERED_CASES", "LOADED_CASES", "DELIVERED_CASES", "RETURNED_CASES"),# Calculating the minimum value considering values > 0 and rounding to four decimal placesmin =c(round(min(op_data$ORDERED_CASES[op_data$ORDERED_CASES >0]), 4),round(min(op_data$LOADED_CASES[op_data$LOADED_CASES >0]), 4),round(min(op_data$DELIVERED_CASES[op_data$DELIVERED_CASES >0]), 4),round(min(op_data$RETURNED_CASES[op_data$RETURNED_CASES >0]), 4) ),# Median calculation considering only values greater than 0median =c(median(op_data$ORDERED_CASES[op_data$ORDERED_CASES >0]),median(op_data$LOADED_CASES[op_data$LOADED_CASES >0]),median(op_data$DELIVERED_CASES[op_data$DELIVERED_CASES >0]),median(op_data$RETURNED_CASES[op_data$RETURNED_CASES >0]) ),# Maximum without decimal placesmax =c(floor(max(op_data$ORDERED_CASES)),floor(max(op_data$LOADED_CASES)),floor(max(op_data$DELIVERED_CASES)),floor(max(op_data$RETURNED_CASES)) ),# Sum with thousands separatorsum_qtd =c(format(sum(op_data$ORDERED_CASES), big.mark =","),format(sum(op_data$LOADED_CASES), big.mark =","),format(sum(op_data$DELIVERED_CASES), big.mark =","),format(sum(op_data$RETURNED_CASES), big.mark =",") ),# Number of transactions with thousands separatornum_trans =c(format(sum(op_data$ORDERED_CASES >0), big.mark =","),format(sum(op_data$LOADED_CASES >0), big.mark =","),format(sum(op_data$DELIVERED_CASES >0), big.mark =","),format(sum(op_data$RETURNED_CASES >0), big.mark =",") ),# Average quantity per transaction without decimalsavg_qtd_by_trans =c(round(sum(op_data$ORDERED_CASES) /max(1, sum(op_data$ORDERED_CASES >0))),round(sum(op_data$LOADED_CASES) /max(1, sum(op_data$LOADED_CASES >0))),round(sum(op_data$DELIVERED_CASES) /max(1, sum(op_data$DELIVERED_CASES >0))),round(sum(op_data$RETURNED_CASES) /max(1, sum(op_data$RETURNED_CASES >0))) ))# Create the table using kableExtra for better formattingsummary_cases %>%kable("html", escape =FALSE, align ="c") %>%kable_styling(full_width = F, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:7, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#ADD8E6") %>%# Light blue headeradd_header_above(c("CASES - Statistics by transactions greater than 0"=7)) %>%kable_paper("striped", full_width = F)```Considering all case transactions, we created the table above to generate some key metrics. The values for ORDERED CASES, LOADED CASES, and DELIVERED CASES are similar, as expected. There are records with quantities less than 1 unit, and the maximum values exceed 8,000 cases, with the average per transaction being approximately 35 cases.The number of transactions for RETURNED CASES is much smaller, but there was a return of 3,132 cases. The average number of cases per transaction is 60.```{r, message=FALSE, warning=FALSE, fig.width=6, fig.height=3}# Transforming the data to long formatop_data_long <- op_data %>% dplyr::select(ORDERED_CASES, LOADED_CASES, DELIVERED_CASES) %>%pivot_longer(cols =everything(), names_to ="case_type", values_to ="count") %>%mutate(case_type =factor(case_type, levels =c("ORDERED_CASES", "LOADED_CASES", "DELIVERED_CASES")))# Define border colors based on case_typeborder_colors <-c("ORDERED_CASES"="grey", "LOADED_CASES"="lightblue", "DELIVERED_CASES"="darkblue")# Plot with histogramsggplot(op_data_long, aes(x = count)) +geom_histogram(binwidth =1, aes(fill = case_type, color = case_type), alpha =0.7) +facet_wrap(~case_type, scales ="free_x", nrow =1, labeller =as_labeller(c("ORDERED_CASES"="Ordered", "LOADED_CASES"="Loaded", "DELIVERED_CASES"="Delivered"))) +scale_y_continuous(trans ='log10', breaks = scales::trans_breaks("log10", function(x) 10^x), labels = scales::trans_format("log10", scales::math_format(10^.x))) +scale_x_continuous(limits =c(0, 5000)) +# Limit x-axis to 5000scale_color_manual(values = border_colors) +theme_minimal() +labs(title ="Histograms of Case Counts", x ="Case Count", y ="Frequency (Log Scale)") +theme(strip.background =element_blank(), strip.text =element_text(color ="black", size =9), panel.grid.major.x =element_blank(), panel.grid.minor =element_blank(), panel.grid.major.y =element_line(color ="grey", size =0.5), axis.title =element_text(size =7),axis.text =element_text(size =6),plot.title =element_text(size =10, face ="bold", hjust =0.5), strip.text.x =element_text(size =8, hjust =0.5), legend.position ="none", axis.text.y =element_text(size =7), axis.title.y =element_text(size =8), panel.spacing =unit(1, "lines") )```Above, we have the histogram of transactions related to case counts. We have limited the visualization to 5000 cases and applied a logarithmic scale for better interpretation. It is noticeable that the number of transactions decreases near 1900 cases and then increases again around 2000. This could potentially correlate with the larger clients.Below is the histogram of returned cases, where it is evident that the number of transactions is relatively low, with quantities generally not exceeding 250 cases. There are some transactions exceeding 1,000 cases, but they are rare. These were excluded to make the chart more interpretable.```{r, message=FALSE, warning=FALSE, fig.width=3, fig.height=3}# Transforming the data to long format for RETURNED_CASESop_data_long_returned <- op_data %>% dplyr::select(RETURNED_CASES) %>%pivot_longer(cols =everything(), names_to ="case_type", values_to ="count") %>%mutate(case_type =factor(case_type, levels =c("RETURNED_CASES")))# Define border colors for RETURNED_CASESborder_colors_returned <-c("RETURNED_CASES"="black")# Plot with histogram for RETURNED_CASESggplot(op_data_long_returned, aes(x = count)) +geom_histogram(binwidth =1, aes(fill = case_type, color = case_type), alpha =0.7) +scale_x_continuous(limits =c(0, 1000)) +# Set max limit for x-axisscale_y_continuous(trans ='log10', breaks = scales::trans_breaks("log10", function(x) 10^x), labels = scales::trans_format("log10", scales::math_format(10^.x))) +scale_color_manual(values = border_colors_returned) +theme_minimal() +labs(title ="Returned Case Counts", x ="Case Count", y ="Frequency (Log Scale)") +theme(strip.background =element_blank(), strip.text =element_text(color ="black", size =9), panel.grid.major.x =element_blank(), panel.grid.minor =element_blank(), panel.grid.major.y =element_line(color ="grey", size =0.5), axis.title =element_text(size =7),axis.text =element_text(size =6),plot.title =element_text(size =10, face ="bold", hjust =0.5), strip.text.x =element_text(size =8, hjust =0.5), legend.position ="none", axis.text.y =element_text(size =7), axis.title.y =element_text(size =8), panel.spacing =unit(1, "lines") )```### 4.10 Transactions by Gallons```{r}# Create the summary table with adjusted minimum value and median considering values > 0 for GALLONSsummary_gallons <-data.frame(type =c("ORDERED_GALLONS", "LOADED_GALLONS", "DELIVERED_GALLONS", "RETURNED_GALLONS"),# Calculating the minimum value considering values > 0 and rounding to four decimal placesmin =c(round(min(op_data$ORDERED_GALLONS[op_data$ORDERED_GALLONS >0]), 4),round(min(op_data$LOADED_GALLONS[op_data$LOADED_GALLONS >0]), 4),round(min(op_data$DELIVERED_GALLONS[op_data$DELIVERED_GALLONS >0]), 4),round(min(op_data$RETURNED_GALLONS[op_data$RETURNED_GALLONS >0]), 4) ),# Median calculation considering only values greater than 0median =c(median(op_data$ORDERED_GALLONS[op_data$ORDERED_GALLONS >0]),median(op_data$LOADED_GALLONS[op_data$LOADED_GALLONS >0]),median(op_data$DELIVERED_GALLONS[op_data$DELIVERED_GALLONS >0]),median(op_data$RETURNED_GALLONS[op_data$RETURNED_GALLONS >0]) ),# Maximum without decimal placesmax =c(floor(max(op_data$ORDERED_GALLONS)),floor(max(op_data$LOADED_GALLONS)),floor(max(op_data$DELIVERED_GALLONS)),floor(max(op_data$RETURNED_GALLONS)) ),# Sum with no decimal placessum_qtd =c(format(floor(sum(op_data$ORDERED_GALLONS)), big.mark =","),format(floor(sum(op_data$LOADED_GALLONS)), big.mark =","),format(floor(sum(op_data$DELIVERED_GALLONS)), big.mark =","),format(floor(sum(op_data$RETURNED_GALLONS)), big.mark =",") ),# Number of transactions with thousands separatornum_trans =c(format(sum(op_data$ORDERED_GALLONS >0), big.mark =","),format(sum(op_data$LOADED_GALLONS >0), big.mark =","),format(sum(op_data$DELIVERED_GALLONS >0), big.mark =","),format(sum(op_data$RETURNED_GALLONS >0), big.mark =",") ),# Average quantity per transaction without decimalsavg_qtd_by_trans =c(round(sum(op_data$ORDERED_GALLONS) /max(1, sum(op_data$ORDERED_GALLONS >0))),round(sum(op_data$LOADED_GALLONS) /max(1, sum(op_data$LOADED_GALLONS >0))),round(sum(op_data$DELIVERED_GALLONS) /max(1, sum(op_data$DELIVERED_GALLONS >0))),round(sum(op_data$RETURNED_GALLONS) /max(1, sum(op_data$RETURNED_GALLONS >0))) ))# Create the table using kableExtra for better formattingsummary_gallons %>%kable("html", escape =FALSE, align ="c") %>%kable_styling(full_width = F, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:7, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#FFCCCB") %>%# Light blue headeradd_header_above(c("GALLONS - Statistics by transactions greater than 0"=7)) %>%kable_paper("striped", full_width = F)```The values for ORDERED GALLONS, LOADED GALLONS, and DELIVERED GALLONS are similar, as expected. There are records with quantities less than 1 unit, and the maximum values exceed 2,200 gallons, with the average per transaction being approximately 21 gallons. The number of gallon transactions is significantly lower than that of cases, at about 60%.The number of transactions for RETURNED GALLONS is much smaller, but there was a return of 1,792 gallons. The average number of gallons per transaction is 18.```{r, message=FALSE, warning=FALSE, fig.width=6, fig.height=3}# Transforming the data to long format for gallonsop_data_long_gallons <- op_data %>% dplyr::select(ORDERED_GALLONS, LOADED_GALLONS, DELIVERED_GALLONS) %>%pivot_longer(cols =everything(), names_to ="gallon_type", values_to ="count") %>%mutate(gallon_type =factor(gallon_type, levels =c("ORDERED_GALLONS", "LOADED_GALLONS", "DELIVERED_GALLONS")))# Define border colors based on gallon_typeborder_colors_gallons <-c("ORDERED_GALLONS"="grey", "LOADED_GALLONS"="coral", "DELIVERED_GALLONS"="darkred")# Plot with histograms for gallonsggplot(op_data_long_gallons, aes(x = count)) +geom_histogram(binwidth =1, aes(fill = gallon_type, color = gallon_type), alpha =0.7) +facet_wrap(~gallon_type, scales ="fixed", nrow =1, labeller =as_labeller(c("ORDERED_GALLONS"="Ordered", "LOADED_GALLONS"="Loaded", "DELIVERED_GALLONS"="Delivered"))) +scale_y_continuous(trans ='log10', breaks = scales::trans_breaks("log10", function(x) 10^x), labels = scales::trans_format("log10", scales::math_format(10^.x))) +scale_color_manual(values = border_colors_gallons) +scale_x_continuous(limits =c(0, 1000)) +# Limit the x-axis to 1000theme_minimal() +labs(title ="Histograms of Gallon Counts", x ="Gallon Count", y ="Frequency (Log Scale)") +theme(strip.background =element_blank(), strip.text =element_text(color ="black", size =9), panel.grid.major.x =element_blank(), panel.grid.minor =element_blank(), panel.grid.major.y =element_line(color ="grey", size =0.5), axis.title =element_text(size =7),axis.text =element_text(size =6),plot.title =element_text(size =10, face ="bold", hjust =0.5), strip.text.x =element_text(size =8, hjust =0.5), legend.position ="none", axis.text.y =element_text(size =7), axis.title.y =element_text(size =8), panel.spacing =unit(1, "lines") )```We limited the histograms of gallon counts per transaction to 1000 for better visualization. There are only a few operations that exceed this limit. The vast majority of transactions do not exceed 500 gallons.```{r, message=FALSE, warning=FALSE, fig.width=3, fig.height=3}# Transforming the data to long format for RETURNED_GALLONSop_data_long_returned_gallons <- op_data %>% dplyr::select(RETURNED_GALLONS) %>%pivot_longer(cols =everything(), names_to ="gallon_type", values_to ="count") %>%mutate(gallon_type =factor(gallon_type, levels =c("RETURNED_GALLONS")))# Define border colors for RETURNED_GALLONSborder_colors_returned_gallons <-c("RETURNED_GALLONS"="black")# Plot with histogram for RETURNED_GALLONSggplot(op_data_long_returned_gallons, aes(x = count)) +geom_histogram(binwidth =1, aes(fill = gallon_type, color = gallon_type), alpha =0.7) +scale_y_continuous(trans ='log10', breaks = scales::trans_breaks("log10", function(x) 10^x), labels = scales::trans_format("log10", scales::math_format(10^.x))) +scale_color_manual(values = border_colors_returned_gallons) +scale_x_continuous(limits =c(0, 500)) +# Limit the x-axis to 500theme_minimal() +labs(title ="Returned Gallon Counts", x ="Gallon Count", y ="Frequency (Log Scale)") +theme(strip.background =element_blank(), strip.text =element_text(color ="black", size =9), panel.grid.major.x =element_blank(), panel.grid.minor =element_blank(), panel.grid.major.y =element_line(color ="grey", size =0.5), axis.title =element_text(size =7),axis.text =element_text(size =6),plot.title =element_text(size =10, face ="bold", hjust =0.5), strip.text.x =element_text(size =8, hjust =0.5), legend.position ="none", axis.text.y =element_text(size =7), axis.title.y =element_text(size =8), panel.spacing =unit(1, "lines") )```The number of returned gallon transactions is much lower compared to cases. Overall, these transactions do not exceed 100 gallons.### 4.11 Transaction Dates Overview```{r}# Aggregate the transactions by month/year for gallons and cases deliveredop_data_monthly_delivery <- op_data %>%mutate(Month_Year =floor_date(TRANSACTION_DATE, "month")) %>%group_by(Month_Year) %>%summarise(Total_Delivered_Cases =sum(DELIVERED_CASES, na.rm =TRUE),Total_Delivered_Gallons =sum(DELIVERED_GALLONS, na.rm =TRUE))# Reshape the data to long format for facet_wrapop_data_long_delivery <- op_data_monthly_delivery %>%pivot_longer(cols =starts_with("Total_Delivered"), names_to ="Event", values_to ="Value")# Create the plot with the same Y-axis scale for both eventsggplot(op_data_long_delivery, aes(x = Month_Year, y = Value, fill = Event)) +geom_bar(stat ="identity", position ="dodge") +facet_wrap(~ Event, scales ="fixed", ncol =1) +# Use facet_wrap with a shared x-axis and same scale for bothlabs(title ="Monthly Delivered Cases and Gallons JAN 2023 - DEZ 2024",x ="Month",y ="Total Units") +theme_minimal() +theme(legend.position ="none", # Remove the legendaxis.text.x =element_text(size =9), # Adjust the size of x-axis labels for better readabilitypanel.grid.major.x =element_blank(), # Remove vertical grid linespanel.grid.minor.x =element_blank()) +# Remove minor vertical grid linesscale_x_date(labels = scales::date_format("%b"), breaks = scales::date_breaks("1 month")) +# Format x-axis to show only month abbreviations (JAN, FEB, etc.)scale_y_continuous(labels = scales::comma) +# Display Y-axis in full units (e.g., 1000 instead of 1K)scale_fill_manual(values =c("Total_Delivered_Gallons"="#FFCCCB", "Total_Delivered_Cases"="#ADD8E6")) # Set custom colors```The seasonal effect, related to lower temperatures (OCT-MAR), is more pronounced for the number of delivered cases than for gallons. Additionally, this chart highlights the significant difference in consumption between the two, as both quantities are represented on the same scale.```{r}# Aggregate the transactions by month/year for gallons and cases deliveredop_data_monthly_delivery <- op_data %>%mutate(Month_Year =floor_date(TRANSACTION_DATE, "month")) %>%group_by(Month_Year) %>%summarise(Total_Delivered_Cases =sum(DELIVERED_CASES, na.rm =TRUE),Total_Delivered_Gallons =sum(DELIVERED_GALLONS, na.rm =TRUE))# Calculate the percentage of gallons in total (gallons + cases)op_data_monthly_delivery <- op_data_monthly_delivery %>%mutate(Total_Sales = Total_Delivered_Cases + Total_Delivered_Gallons,Percentage_Gallons = (Total_Delivered_Gallons / Total_Sales) *100)# Create the plot with the percentage of gallons soldggplot(op_data_monthly_delivery, aes(x = Month_Year, y = Percentage_Gallons)) +geom_bar(stat ="identity", fill ="#FFCCCB") +# Gallons colorlabs(title ="Percentage of Gallons Sold Relative to Total Sales (23 & 24)",x ="Month",y ="Percentage of Gallons (%)") +theme_minimal() +theme(axis.text.x =element_text(size =9, angle =0, hjust =1), # Rotate x-axis labels for better readabilitypanel.grid.major.x =element_blank(), # Remove vertical grid linespanel.grid.minor.x =element_blank()) +# Remove minor vertical grid linesscale_x_date(labels = scales::date_format("%b"), breaks = scales::date_breaks("1 month")) +# Format x-axis to show month abbreviationsscale_y_continuous(labels = scales::percent_format(scale =1), breaks =seq(0, 100, by =5)) # Set y-axis breaks to show percentages in 5% increments```The sale of gallons over the months remains between 20% and 25% of the total volume.### 4.12 Retailer Consumption Quantities```{r}# Count distinct Retailerscat("Number of Retailers:", n_distinct(full_data$PRIMARY_GROUP_NUMBER), "\n")# Count distinct storescat("Number of Outlets/Stores:", n_distinct(full_data$CUSTOMER_NUMBER), "\n")```Of the 30,320 stores, many belong to the same chains, with 1,020 networks represented in the dataset. (PRIMARY_GROUP_NUMBER = 0 represents the single stores.)```{r}# Creates the total deliveries by customer type (Single Store or Retailer Group)total_delivered <- full_data %>%mutate(customer_type =ifelse(PRIMARY_GROUP_NUMBER ==0, "Single Store", "Retailer Group")) %>%group_by(customer_type) %>%summarise(qtd_cases_dlv_23 =sum(ifelse(YEAR ==2023, DELIVERED_CASES, 0), na.rm =TRUE),qtd_cases_dlv_24 =sum(ifelse(YEAR ==2024, DELIVERED_CASES, 0), na.rm =TRUE),total_qtd_cases_dlv =sum(DELIVERED_CASES, na.rm =TRUE),total_qtd_gallons_dlv =sum(DELIVERED_GALLONS, na.rm =TRUE) ) %>%ungroup()# Calculates global totals for delivered cases and delivered gallonstotal_cases <-sum(full_data$DELIVERED_CASES, na.rm =TRUE)total_gallons <-sum(full_data$DELIVERED_GALLONS, na.rm =TRUE)# Calculates the percentage for each grouptotal_delivered <- total_delivered %>%mutate(perc_total_qtd_cases = (total_qtd_cases_dlv / total_cases) *100,perc_total_gallons = (total_qtd_gallons_dlv / total_gallons) *100 )# Converts to data.table for efficient processingsetDT(total_delivered)# Rounds percentagestotal_delivered[, perc_total_qtd_cases :=round(perc_total_qtd_cases, 0)]total_delivered[, perc_total_gallons :=round(perc_total_gallons, 0)]# Adds a 'Total' row with global totalstotal_delivered_total <- total_delivered %>%summarise(customer_type ="Total",qtd_cases_dlv_23 =sum(qtd_cases_dlv_23),qtd_cases_dlv_24 =sum(qtd_cases_dlv_24),total_qtd_cases_dlv =sum(total_qtd_cases_dlv),total_qtd_gallons_dlv =sum(total_qtd_gallons_dlv),perc_total_qtd_cases =100,perc_total_gallons =100 ) %>%as.data.table()# Combines the 'Total' row with the previous datatotal_delivered <-rbind(total_delivered, total_delivered_total)# Creates the cases table with the relevant columnscases_table <- total_delivered[, .( customer_type, qtd_cases_dlv_23, qtd_cases_dlv_24, total_qtd_cases_dlv, perc_total_qtd_cases)]# Creates the gallons table with the same columns as the cases tablegallons_table <- total_delivered[, .( customer_type,qtd_gallons_dlv_23 = total_qtd_gallons_dlv, # Corresponding for 2023qtd_gallons_dlv_24 = total_qtd_gallons_dlv, # Corresponding for 2024total_qtd_cases_dlv = total_qtd_gallons_dlv, # Total gallonsperc_total_qtd_cases = perc_total_gallons # Percentage for gallons)]# Creates the total table with the relevant columnstotal_table <- total_delivered[, .( customer_type, qtd_cas_gal_23 = qtd_cases_dlv_23 + total_qtd_gallons_dlv,qtd_cas_gal_24 = qtd_cases_dlv_24 + total_qtd_gallons_dlv,total_qtd_cas_gal = total_qtd_cases_dlv + total_qtd_gallons_dlv,perc_total_qtd = ((total_qtd_cases_dlv + total_qtd_gallons_dlv) / (total_cases + total_gallons)) *100)]# Rounds the percentage for the total tabletotal_table[, perc_total_qtd :=round(perc_total_qtd, 0)]# Format the numeric columns with a thousand separator for all tablesformat_cols_cases <-c("qtd_cases_dlv_23", "qtd_cases_dlv_24", "total_qtd_cases_dlv", "perc_total_qtd_cases")format_cols_gallons <-c("qtd_gallons_dlv_23", "qtd_gallons_dlv_24", "total_qtd_cases_dlv", "perc_total_qtd_cases")format_cols_total <-c("qtd_cas_gal_23", "qtd_cas_gal_24", "total_qtd_cas_gal", "perc_total_qtd")# Format the columns after the tables are createdcases_table[, (format_cols_cases) :=lapply(.SD, function(x) format(x, big.mark =",", scientific =FALSE)), .SDcols = format_cols_cases]gallons_table[, (format_cols_gallons) :=lapply(.SD, function(x) format(x, big.mark =",", scientific =FALSE)), .SDcols = format_cols_gallons]total_table[, (format_cols_total) :=lapply(.SD, function(x) format(x, big.mark =",", scientific =FALSE)), .SDcols = format_cols_total]# Displays casescases_table %>%kable("html", escape =FALSE, align ="c") %>%kable_styling(full_width = F, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:5, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#ADD8E6") %>%# Light blue headeradd_header_above(c("CASES - Statistics by deliveries greater than 0"=5)) %>%kable_paper("striped", full_width = F)```Considering cases, 80% of the volume went to stores that belong to larger groups.```{r}# Displays gallonsgallons_table %>%kable("html", escape =FALSE, align ="c") %>%kable_styling(full_width = F, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:5, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#FFCCCB") %>%# Light red headeradd_header_above(c("GALLONS - Statistics by deliveries greater than 0"=5)) %>%kable_paper("striped", full_width = F)```As for gallons, the distribution is similar, with 53% going to single stores and 47% to retailer groups, indicating that local stores have a greater share in gallon consumption compared to cases.```{r}# Displays total (cases + gallons)total_table %>%kable("html", escape =FALSE, align ="c") %>%kable_styling(full_width = F, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:5, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="lightgray") %>%# Light blue headeradd_header_above(c("TOTAL - Combined Deliveries Quantities for Cases and Gallons"=5)) %>%kable_paper("striped", full_width = F)```The table below helps to better explore the data presented above.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Summarizing CO2_CUSTOMER status and LOCAL_MARKET_PARTNER statusstatus_summary <- full_data %>%group_by(PRIMARY_GROUP_NUMBER) %>%summarise(co2_status =case_when(all(CO2_CUSTOMER ==0) ~"Only 0",all(CO2_CUSTOMER ==1) ~"Only 1",TRUE~"Both 0 and 1" ),lmp_status =case_when(all(LOCAL_MARKET_PARTNER ==0) ~"Only 0",all(LOCAL_MARKET_PARTNER ==1) ~"Only 1",TRUE~"Both 0 and 1" ) ) %>%filter(co2_status =="Only 0"& lmp_status =="Only 1")# View the filtered resultstatus_summary``````{r}# Summarize delivered cases and gallons for 2023 and 2024summary_2023 <- full_data %>%filter(YEAR ==2023) %>%group_by(PRIMARY_GROUP_NUMBER) %>%summarise(cas_qtd_dlv23 =sum(DELIVERED_CASES, na.rm =TRUE),gal_qtd_dlv23 =sum(DELIVERED_GALLONS, na.rm =TRUE) )summary_2024 <- full_data %>%filter(YEAR ==2024) %>%group_by(PRIMARY_GROUP_NUMBER) %>%summarise(cas_qtd_dlv24 =sum(DELIVERED_CASES, na.rm =TRUE),gal_qtd_dlv24 =sum(DELIVERED_GALLONS, na.rm =TRUE) )# Merge summaries and compute total valuesgroup_demand <-full_join(summary_2023, summary_2024, by ="PRIMARY_GROUP_NUMBER") %>%mutate(across(c(cas_qtd_dlv23, gal_qtd_dlv23, cas_qtd_dlv24, gal_qtd_dlv24), ~replace_na(., 0)),total_23 = cas_qtd_dlv23 + gal_qtd_dlv23,total_24 = cas_qtd_dlv24 + gal_qtd_dlv24,sum_23_24 = total_23 + total_24 ) %>%rename(PGN = PRIMARY_GROUP_NUMBER) %>%arrange(desc(sum_23_24)) # %>%# filter(PGN != 0) # Exclude rows where PRIMARY_GROUP_NUMBER is 0# Convert to data.table for performancesetDT(group_demand)# Display interactive table with formatted numbers (without changing type)datatable( group_demand, options =list(pageLength =10, autoWidth =TRUE),rownames =FALSE,caption ="Quantity Delivered") %>%formatCurrency(columns =c("cas_qtd_dlv23", "gal_qtd_dlv23", "cas_qtd_dlv24", "gal_qtd_dlv24", "total_23", "total_24", "sum_23_24"),currency ="", # No currency symboldigits =0, # No decimal placesmark =","# Thousands separator )# List all variables in the environmentall_vars <-ls()# Exclude 'full_data', 'full_data_customer', and the new variables from removalvars_to_keep <-c("full_data", "full_data_customer", "cost_data", "customer_address", "mydir", "one_seed", "op_data", "profile_data", "reference_date","custom_palette")# Get the variables to removevars_to_remove <-setdiff(all_vars, vars_to_keep)# Remove the temporary data framesrm(list = vars_to_remove)# Clean up by removing 'all_vars' and 'vars_to_remove'rm(all_vars, vars_to_remove)```## 5. Feature EngineeringConsidering all the previous analyses, the goal now is to complement the information that can enhance the robustness of the modeling process. Several feature engineering techniques were attempted, but only the most relevant ones will be described.### 5.1 Census DataThe data used for updating the location information comes from the U.S. Census Bureau, specifically the American Community Survey (ACS), which annually adjusts its results based on the most recent data. For 2023, the ACS data was retrieved, which is adjusted using the 2020 Census data. However, data for 2024 was not yet available at the time of retrieval.The decision to use coordinates for store locations, even when there are multiple instances of identical coordinates across different ZIP codes, was made due to the challenges encountered when retrieving Census data based on ZIP codes. Different stores or customers within the same ZIP code can share coordinates, particularly in areas like shopping centers with multiple businesses.Below are the descriptions of the import data:```{r, out.height=2}#Creating the data for the table census_data <-tibble(variable =c("MED_HH_INC", "GINI_IDX", "PER_CAP_INC", "MED_HOME_VAL", "POV_POP", "INC_LVL_1", "INC_LVL_2", "INC_LVL_3", "INC_LVL_4", "INC_LVL_5", "INC_LVL_6", "INC_LVL_7", "INC_LVL_8", "INC_LVL_9", "INC_LVL_10", "INC_LVL_11", "INC_LVL_12", "INC_LVL_13", "INC_LVL_14", "INC_LVL_15", "INC_LVL_16", "TOT_HOUS_UNITS", "VAC_HOUS_UNITS", "MED_GROSS_RENT", "BACH_DEG", "MAST_DEG", "DOC_DEG", "UNEMP_POP", "EMP_POP", "TOT_WORK_POP", "SNAP_HH", "MED_FAM_INC", "TOT_POP", "MALE_POP", "FEMALE_POP", "COMMUTE_POP", "COMMUTE_POP_DRIVE" ),description =c("Median household income", "Gini index of income inequality", "Per capita income", "Median home value", "Population below poverty", "Income less than $10,000", "$10,000 to $14,999", "$15,000 to $19,999", "$20,000 to $24,999", "$25,000 to $29,999", "$30,000 to $34,999", "$35,000 to $39,999", "$40,000 to $44,999", "$45,000 to $49,999", "$50,000 to $59,999", "$60,000 to $74,999", "$75,000 to $99,999", "$100,000 to $124,999", "$125,000 to $149,999", "$150,000 to $199,999", "$200,000 or more", "Total housing units", "Vacant housing units", "Median gross rent", "Bachelor's degree holders", "Master's degree holders", "Doctoral degree holders", "Unemployed population", "Employed population", "Total working population", "Food stamp households", "Median family income", "Total population", "Male population", "Female population", "Total commuter population", "Total commuter population driving" ) )#Tabledatatable(census_data, options =list(scrollX =TRUE, pageLength =10), caption ="List of Census Variables and Descriptions")``````{r, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}library(tidycensus)library(sf) # Census Bureau API key#census_api_key(" ", install = TRUE)# Create a copy of full_data_customer with only the relevant columnsdata_sf <- full_data_customer %>% dplyr::select(CUSTOMER_NUMBER, LONGITUDE, LATITUDE)# Convert customer data to sf objectdata_sf <- data_sf %>%st_as_sf(coords =c("LONGITUDE", "LATITUDE"), crs =4326)# Ensure the 'census_variables' object is definedcensus_variables <-tibble(code =c("B19013_001", "B19083_001", "B19301_001", "B25077_001", "B17001_002", "B19001_002", "B19001_003", "B19001_004", "B19001_005", "B19001_006", "B19001_007", "B19001_008", "B19001_009", "B19001_010", "B19001_011", "B19001_012", "B19001_013", "B19001_014", "B19001_015", "B19001_016", "B19001_017", "B25001_001", "B25002_003", "B25064_001", "B15003_017", "B15003_022", "B15003_025", "B23025_005", "B23025_004", "B24011_001", "B22001_002", "B19058_001", "B01003_001", "B01001_002", "B01001_026", "B08006_001", "B08006_002" ),description =c("MED_HH_INC", "GINI_IDX", "PER_CAP_INC", "MED_HOME_VAL", "POV_POP", "INC_LVL_1", "INC_LVL_2", "INC_LVL_3", "INC_LVL_4", "INC_LVL_5", "INC_LVL_6", "INC_LVL_7", "INC_LVL_8", "INC_LVL_9", "INC_LVL_10", "INC_LVL_11", "INC_LVL_12", "INC_LVL_13", "INC_LVL_14", "INC_LVL_15", "INC_LVL_16", "TOT_HOUS_UNITS", "VAC_HOUS_UNITS", "MED_GROSS_RENT", "BACH_DEG", "MAST_DEG", "DOC_DEG", "UNEMP_POP", "EMP_POP", "TOT_WORK_POP", "SNAP_HH", "MED_FAM_INC", "TOT_POP", "MALE_POP", "FEMALE_POP", "COMMUTE_POP", "COMMUTE_POP_DRIVE" ),full_description =c("Median household income", "Gini index of income inequality", "Per capita income", "Median home value", "Population below poverty", "Income less than $10,000", "$10,000 to $14,999", "$15,000 to $19,999", "$20,000 to $24,999", "$25,000 to $29,999", "$30,000 to $34,999", "$35,000 to $39,999", "$40,000 to $44,999", "$45,000 to $49,999", "$50,000 to $59,999", "$60,000 to $74,999", "$75,000 to $99,999", "$100,000 to $124,999", "$125,000 to $149,999", "$150,000 to $199,999", "$200,000 or more", "Total housing units", "Vacant housing units", "Median gross rent", "Bachelor's degree holders", "Master's degree holders", "Doctoral degree holders", "Unemployed population", "Employed population", "Total working population", "Food stamp households", "Median family income", "Total population", "Male population", "Female population", "Total commuter population", "Total commuter population driving" ))# Retrieve ACS dataacs_data <-get_acs(geography ="tract",variables = census_variables$code,year =2023,state =unique(full_data_customer$STATE),geometry =TRUE)# Merge with descriptionsacs_data <- acs_data %>%left_join(census_variables, by =c("variable"="code"))# Transform CRS to match customer datadata_sf <-st_transform(data_sf, st_crs(acs_data))# Perform spatial joinjoined_data_sf <-st_join(data_sf, acs_data, join = st_intersects)# Reshape the dataset, keeping only the 'estimate' valuescensus <- joined_data_sf %>%mutate(variable_name =if_else(variable %in% census_variables$code, description, variable) ) %>%pivot_wider(names_from = variable_name,values_from = estimate,names_glue ="{variable_name}" )# Select only the required columnscensus <- census %>% dplyr::select( CUSTOMER_NUMBER, MED_HH_INC, GINI_IDX, PER_CAP_INC, MED_HOME_VAL, POV_POP, INC_LVL_1, INC_LVL_2, INC_LVL_3, INC_LVL_4, INC_LVL_5, INC_LVL_6, INC_LVL_7, INC_LVL_8, INC_LVL_9, INC_LVL_10, INC_LVL_11, INC_LVL_12, INC_LVL_13, INC_LVL_14, INC_LVL_15, INC_LVL_16, TOT_HOUS_UNITS, VAC_HOUS_UNITS, MED_GROSS_RENT, BACH_DEG, MAST_DEG, DOC_DEG, UNEMP_POP, EMP_POP, TOT_WORK_POP, SNAP_HH, MED_FAM_INC, TOT_POP, MALE_POP, FEMALE_POP, COMMUTE_POP, COMMUTE_POP_DRIVE )# Remove the geometry column and convert to a normal data framecensus <- census %>%st_drop_geometry() %>%as.data.frame()# Handle missing and infinite values (replace -Inf with NA)census[census ==-Inf] <-NA# Optionally impute missing values or remove themcensus[is.na(census)] <-0# You could also choose to impute using other strategies# Aggregate census data by CUSTOMER_NUMBER, keeping the highest value for each columncensus <- census %>%group_by(CUSTOMER_NUMBER) %>%summarise(across(everything(), max, na.rm =TRUE), .groups ="drop")# Perform the join between full_data_customer and census on the CUSTOMER_NUMBER columnfull_data_customer <- full_data_customer %>% dplyr::left_join(census, by ="CUSTOMER_NUMBER")# Remove any duplicated columns or columns with ".x" suffixesfull_data_customer <- full_data_customer %>% dplyr::select(-ends_with(".x")) %>% dplyr::rename_with(~gsub("\\.y$", "", .), ends_with(".y"))# Transforming variable types before savefull_data_customer$COLD_DRINK_CHANNEL <-as.factor(full_data_customer$COLD_DRINK_CHANNEL)full_data_customer$TRADE_CHANNEL <-as.factor(full_data_customer$TRADE_CHANNEL)full_data_customer$SUB_TRADE_CHANNEL <-as.factor(full_data_customer$SUB_TRADE_CHANNEL)```During the modeling process, it became clear that the absence of 2024 data limited the analysis. In addition, correlations between the census variables and, in particular, customer demand volumes were very low. Because of this, these variables were not explored further in the document. The goal is for this initial process to serve as a foundation for future analyses.### 5.2 RFM ScoreThe RFM (Recency, Frequency, Monetary) analysis segments customers based on purchasing behavior, providing insights into consumption patterns. Adapting this model to analyze customer orders helps assess both the frequency and volume of purchases.#### 5.2.1 Frequency - Days Between OrdersTo adapt the RFM analysis by considering purchase periods and quantities ordered, the analysis will focus on customer orders. Before calculating the number of days between orders (frequency), the total number of orders per customer will be determined, considering only those with a quantity of gallons or cases greater than 0.```{r}# Filter valid transactions (ORDERED_CASES > 0 or ORDERED_GALLONS > 0)valid_orders <- full_data %>%filter(ORDERED_CASES >0| ORDERED_GALLONS >0)# Calculate the number of orders > 0 per customerorders_per_customer <- valid_orders %>%group_by(CUSTOMER_NUMBER) %>%summarise(NUM_ORDERS =n(), .groups ="drop") %>%ungroup()# Add the column NUM_ORDERS in full_data_customerfull_data_customer <- full_data_customer %>%left_join(orders_per_customer, by ="CUSTOMER_NUMBER")# Find customers who do not meet the condition (NO valid transactions)customers_not_meeting_filter <- full_data_customer %>%filter(is.na(NUM_ORDERS)) %>%summarise(unique_customers =n_distinct(CUSTOMER_NUMBER))# Print the number of unique customers who don't meet the filter#print(customers_not_meeting_filter)# Remove unnecessary intermediate data framesrm(valid_orders, orders_per_customer,customers_not_meeting_filter)```There are 135 customers who do not have order transactions greater than zero in the dataset; for these customers, I will consider the number of delivery transactions as orders.```{r}# Filter customers with NUM_ORDERS == NAcustomers_with_na_orders <- full_data_customer %>%filter(is.na(NUM_ORDERS)) %>% dplyr::select(CUSTOMER_NUMBER) %>%distinct()# Filter valid delivery transactions (DELIVERED_CASES > 0 or DELIVERED_GALLONS > 0) in full_datavalid_deliveries <- full_data %>%filter(DELIVERED_CASES >0| DELIVERED_GALLONS >0)# Calculate the number of valid deliveries per customer with NUM_ORDERS == NAdeliveries_per_customer <- valid_deliveries %>%filter(CUSTOMER_NUMBER %in% customers_with_na_orders$CUSTOMER_NUMBER) %>%group_by(CUSTOMER_NUMBER) %>%summarise(NUM_DELIVERIES =n()) %>%ungroup()# Update NUM_ORDERS only for customers with NUM_ORDERS == NAfull_data_customer <- full_data_customer %>%left_join(deliveries_per_customer, by ="CUSTOMER_NUMBER") %>%mutate(NUM_ORDERS =if_else(is.na(NUM_ORDERS), NUM_DELIVERIES, NUM_ORDERS) ) %>% dplyr::select(-NUM_DELIVERIES) # Drop the temporary NUM_DELIVERIES column# Ensure full_data has the NUM_ORDERS column with the same values as full_data_customerfull_data <- full_data %>%left_join(full_data_customer %>% dplyr::select(CUSTOMER_NUMBER, NUM_ORDERS), by ="CUSTOMER_NUMBER")# Remove unnecessary intermediate data framesrm(customers_with_na_orders, valid_deliveries, deliveries_per_customer)```Considering all the order transactions recorded in 2023 and 2024, each unique customer has a minimum of 1 transaction and a maximum of 392 transactions.To better understand the consumption profile of each customer, below we will visualize the number of customers in transaction bins where the orders of cases or gallons were greater than 0. For the 135 unique customers who did not have order transactions but received volume, we considered these operations as orders.```{r}# Count the number of valid transactions per customercustomers_by_bin <- full_data_customer %>%group_by(CUSTOMER_NUMBER) %>%summarise(transaction_count =sum(NUM_ORDERS, na.rm =TRUE), .groups ="drop") %>%mutate(transaction_bin =case_when( transaction_count ==1~"1", transaction_count >=2& transaction_count <=10~"2-10", transaction_count >=11& transaction_count <=20~"11-20", transaction_count >=21& transaction_count <=30~"21-30", transaction_count >=31& transaction_count <=40~"31-40", transaction_count >=41& transaction_count <=50~"41-50", transaction_count >=51& transaction_count <=100~"51-100", transaction_count >=101& transaction_count <=200~"101-200", transaction_count >=201& transaction_count <=300~"201-300", transaction_count >300~">300",TRUE~"Other" )) %>%mutate(transaction_bin =factor(transaction_bin, levels =c("1", "2-10", "11-20", "21-30", "31-40", "41-50", "51-100", "101-200", "201-300", ">300"))) %>%group_by(transaction_bin) %>%summarise(unique_customers =n_distinct(CUSTOMER_NUMBER), .groups ="drop") %>%arrange(transaction_bin)# Create a bar plot resembling a histogram of unique customers per transaction binggplot(customers_by_bin, aes(x = transaction_bin, y = unique_customers, fill = transaction_bin)) +geom_bar(stat ="identity", show.legend =FALSE) +geom_text(aes(label = unique_customers), vjust =-0.3, size =3, color ="black") +# Add customer counts above barsscale_fill_brewer(palette ="Set3") +# Use RColorBrewer's Set3 palettelabs(title ="Number of Unique Customers by Transaction Count (Orders > 0)",x ="Transaction Count Bins",y ="Number of Unique Customers") +theme_minimal() +theme(axis.text.x =element_text(hjust =0.5, vjust =0.5), # Centered x-axis labels without rotationpanel.grid.major.x =element_blank(), # Remove vertical grid linespanel.grid.minor.x =element_blank(), # Remove minor vertical grid linesaxis.text =element_text(size =9), # Set the size of axis labelsaxis.title =element_text(size =10) # Set the size of axis titles )# Remove unnecessary intermediate data framesrm(customers_by_bin)```The histogram shows that 1,218 customers have only one order transaction, making it impossible to calculate the days between orders. Additionally, 6,798 customers have between 2 and 10 orders. To ensure more reliable figures, we will consider only customers with at least 11 orders for this indicator. As a result, all customers with fewer transactions will be assigned a value of 731 days between orders, indicating low order frequency over a two-year range.```{r}# Calculate the number of days between orders for customers with NUM_ORDERS >= 11full_data <- full_data %>%arrange(CUSTOMER_NUMBER, TRANSACTION_DATE) %>%# Sort by CUSTOMER_NUMBER and TRANSACTION_DATEgroup_by(CUSTOMER_NUMBER) %>%mutate(DAYS_BETWEEN_ORD =case_when( NUM_ORDERS <=10~731, # Set DAYS_BETWEEN_ORD to 731 for customers with NUM_ORDERS <= 10 NUM_ORDERS >=11& (ORDERED_CASES >0| ORDERED_GALLONS >0) ~as.numeric(difftime(TRANSACTION_DATE, lag(TRANSACTION_DATE), units ="days")), # Calculate days between orders for NUM_ORDERS >= 11 where ORDERED_CASES or ORDERED_GALLONS > 0 NUM_ORDERS >=11&!(ORDERED_CASES >0| ORDERED_GALLONS >0) &# Only apply this when the previous condition fails (DELIVERED_CASES >0| DELIVERED_GALLONS >0) ~as.numeric(difftime(TRANSACTION_DATE, lag(TRANSACTION_DATE), units ="days")), # If no ORDERED_CASES or ORDERED_GALLONS > 0, calculate with DELIVERED_CASES or DELIVERED_GALLONSTRUE~NA_real_# For all other cases )) %>%ungroup()# Calculate the average days between orders per customer and round the result to the nearest integeravg_days_per_customer <- full_data %>%group_by(CUSTOMER_NUMBER) %>%summarise(AVG_DAYS_BET_ORD =round(mean(DAYS_BETWEEN_ORD, na.rm =TRUE), 0)) %>%# Round to nearest integerungroup()# Update full_data_customer with the average days between ordersfull_data_customer <- full_data_customer %>%left_join(avg_days_per_customer, by ="CUSTOMER_NUMBER")# Remove temporary variablesrm(avg_days_per_customer)``````{r}# Count the number of unique customers in each days between orders bin without adding a new column to the datasetcustomers_by_bin <- full_data_customer %>%mutate(DAYS_BETWEEN_ORD_BIN =case_when( AVG_DAYS_BET_ORD >=1& AVG_DAYS_BET_ORD <=10~"1-10 days", AVG_DAYS_BET_ORD >10& AVG_DAYS_BET_ORD <=20~"11-20 days", AVG_DAYS_BET_ORD >20& AVG_DAYS_BET_ORD <=30~"21-30 days", AVG_DAYS_BET_ORD >30& AVG_DAYS_BET_ORD <=33~"31-40 days", AVG_DAYS_BET_ORD >40& AVG_DAYS_BET_ORD <=50~"41-50 days", AVG_DAYS_BET_ORD >50~"Above 50 days",TRUE~"One Order Only" )) %>%group_by(DAYS_BETWEEN_ORD_BIN) %>%summarise(unique_customers =n_distinct(CUSTOMER_NUMBER), .groups ="drop") %>%mutate(percentage_customers = unique_customers /sum(unique_customers) *100) %>%# Calculate percentagearrange(DAYS_BETWEEN_ORD_BIN)# Create a bar plot resembling a histogram of unique customers percentage per days between orders binggplot(customers_by_bin, aes(x = DAYS_BETWEEN_ORD_BIN, y = percentage_customers, fill = DAYS_BETWEEN_ORD_BIN)) +geom_bar(stat ="identity", show.legend =FALSE) +geom_text(aes(label = scales::percent(percentage_customers /100)), vjust =-0.3, size =3) +# Add percentage labels above barsscale_fill_brewer(palette ="Set3") +# Use RColorBrewer's Set3 palettelabs(title ="Percentage of Unique Customers by Days Between Orders",x ="Days Between Orders",y ="Percentage of Unique Customers") +theme_minimal() +theme(axis.text.x =element_text(hjust =0.5, vjust =0.5), # Centered x-axis labels without rotationpanel.grid.major.x =element_blank(), # Remove vertical grid linespanel.grid.minor.x =element_blank(), # Remove minor vertical grid linesaxis.text =element_text(size =9), # Set the size of axis labelsaxis.title =element_text(size =10) # Set the size of axis titles )# Remove unnecessary intermediate data framesrm(customers_by_bin)```Around 20% of customers had an average order interval of up to 10 days, while 44% showed an average interval of more than 30 days.Approximately 5% of customers placed only one order, making it impossible to calculate the number of days between orders.#### 5.2.2 Recency - Time Since Last OrderTo calculate recency, I will consider the number of days between the date of the last order and 01-01-2025.```{r}# Create the LAST_ORDER_DATE column, excluding rows where all specified columns are zerofull_data <- full_data %>%group_by(CUSTOMER_NUMBER) %>%mutate(LAST_ORDER_DATE =if_else( (ORDERED_CASES >0| ORDERED_GALLONS >0) &!(ORDERED_CASES ==0& ORDERED_GALLONS ==0& LOADED_CASES ==0& LOADED_GALLONS ==0& DELIVERED_CASES ==0& DELIVERED_GALLONS ==0),as.character(max(TRANSACTION_DATE, na.rm =TRUE)), NA_character_ ) ) %>%ungroup()```There are 5,754 transaction rows where assigning the last transaction date based on orders is not possible. For these, the date of the last delivery operation will be used as the reference date. The last two transactions, referring to return transactions, will be excluded.```{r, message=FALSE, warning=FALSE}# For customers with LAST_ORDER_DATE as NA, consider the latest TRANSACTION_DATE where DELIVERED_CASES or DELIVERED_GALLONS > 0full_data <- full_data %>%mutate(LAST_ORDER_DATE =as.Date(LAST_ORDER_DATE)) %>%# Convert LAST_ORDER_DATE to Date formatgroup_by(CUSTOMER_NUMBER) %>%mutate(LAST_ORDER_DATE =if_else(is.na(LAST_ORDER_DATE) & (ORDERED_CASES ==0& ORDERED_GALLONS ==0),as.Date(max(TRANSACTION_DATE[DELIVERED_CASES >0| DELIVERED_GALLONS >0], na.rm =TRUE)), LAST_ORDER_DATE ) ) %>%ungroup()# Remove the last 2 rows where LAST_ORDER_DATE is NA (return operations only)full_data <- full_data %>%filter(!is.na(LAST_ORDER_DATE))# Remove rows where LAST_ORDER_DATE is Inf (return operations only)full_data <- full_data %>%filter(!is.infinite(LAST_ORDER_DATE))# Reference Datereference_date <-as.Date("2025-01-01")# Create the DAYS_AF_LAST_ORD column in full_datafull_data <- full_data %>%mutate(DAYS_AF_LAST_ORD =ifelse(!is.na(LAST_ORDER_DATE), as.numeric(difftime(reference_date, LAST_ORDER_DATE, units ="days")),NA_real_))# Aggregate full_data to get the latest LAST_ORDER_DATE and DAYS_AF_LAST_ORD for each CUSTOMER_NUMBERfull_data_aggregated <- full_data %>%group_by(CUSTOMER_NUMBER) %>%summarise(LAST_ORDER_DATE =max(LAST_ORDER_DATE, na.rm =TRUE),DAYS_AF_LAST_ORD =max(DAYS_AF_LAST_ORD, na.rm =TRUE),.groups ='drop' )# Join the aggregated data with full_data_customerfull_data_customer <- full_data_customer %>%left_join(full_data_aggregated, by ="CUSTOMER_NUMBER")# # Remove unnecessary intermediate data framesrm(full_data_aggregated)```#### 5.2.3 Total Quantity OrderedAs there is no access to the prices charged, and considering that they likely vary among customer types and demanded volumes, the focus will be on the quantities demanded instead of monetary values. This approach aligns with the current objective of customer segmentation.```{r}# Calculate the total ordered by customer by summing ORDERED_CASES and ORDERED_GALLONStotal_ordered_per_customer <- full_data %>%group_by(CUSTOMER_NUMBER) %>%summarise(TOTAL_ORDERED =sum(ORDERED_CASES, na.rm =TRUE) +sum(ORDERED_GALLONS, na.rm =TRUE)) %>%ungroup()# Add the TOTAL_ORDERED column to full_data_customer by CUSTOMER_NUMBERfull_data_customer <- full_data_customer %>%left_join(total_ordered_per_customer, by ="CUSTOMER_NUMBER")# Identify customers with TOTAL_ORDERED == 0customers_with_zero_ordered <- total_ordered_per_customer %>%filter(TOTAL_ORDERED ==0)# For those customers, calculate DELIVERED_CASES + DELIVERED_GALLONS from full_datadeliveries_for_zero_orders <- full_data %>%filter(CUSTOMER_NUMBER %in% customers_with_zero_ordered$CUSTOMER_NUMBER) %>%group_by(CUSTOMER_NUMBER) %>%summarise(DELIVERED_TOTAL =sum(DELIVERED_CASES, na.rm =TRUE) +sum(DELIVERED_GALLONS, na.rm =TRUE)) %>%ungroup()# Merge the delivery values into the total_ordered_per_customer dataframe,# ensuring that if TOTAL_ORDERED is zero, it is replaced by DELIVERED_TOTALtotal_ordered_per_customer <- total_ordered_per_customer %>%left_join(deliveries_for_zero_orders, by ="CUSTOMER_NUMBER") %>%mutate(TOTAL_ORDERED =if_else(TOTAL_ORDERED ==0, DELIVERED_TOTAL, TOTAL_ORDERED) ) %>% dplyr::select(CUSTOMER_NUMBER, TOTAL_ORDERED)# Add the updated TOTAL_ORDERED column to full_data_customer by CUSTOMER_NUMBERfull_data_customer <- full_data_customer %>%left_join(total_ordered_per_customer, by ="CUSTOMER_NUMBER")# Remove the 'TOTAL_ORDERED.x' column and rename 'TOTAL_ORDERED.y' to 'TOTAL_ORDERED'full_data_customer <- full_data_customer %>% dplyr::select(-TOTAL_ORDERED.x) %>% dplyr::rename(TOTAL_ORDERED = TOTAL_ORDERED.y)# Remove unnecessary intermediate data framesrm(total_ordered_per_customer, customers_with_zero_ordered, deliveries_for_zero_orders)```#### 5.2.4 Adapted RFM ScoreScores were assigned to classes based on the distribution of the created variables. The total score, combined with its relative weight, formed the *RFM_SCORE*, which served as an additional variable for customer analysis and segmentation.To define these scores, the quantitative distribution of each variable was used, especially considering the wide range observed in some of them. Each variable received a score from 1 to 10. In the case of frequency, two separate variables were created, and weight was given not only to the number of orders but also to the interval between them. As a result, the total score ranged from 4 to 40.```{r}# Remove previously created columns#full_data_customer <- full_data_customer %>%# dplyr::select(-FREQUENCY_SCORE, -RECENCY_SCORE, -VOLUME_SCORE, -RFM_SCORE)# Create Frequency Score based on NUM_ORDERSfull_data_customer <- full_data_customer %>%mutate(ORDER_FREQUENCY_SCORE =case_when( NUM_ORDERS >=300~10, NUM_ORDERS >=200~9, NUM_ORDERS >=150~8, NUM_ORDERS >=100~7, NUM_ORDERS >=75~6, NUM_ORDERS >=50~5, # 3rd quartile NUM_ORDERS >=35~4, # Mean NUM_ORDERS >=23~3, # Median NUM_ORDERS >=10~2, # 1st quartileTRUE~1 ),ORDER_INTERVAL_SCORE =case_when( AVG_DAYS_BET_ORD <=5~10, AVG_DAYS_BET_ORD <=13~9, # 1st quartile AVG_DAYS_BET_ORD <=20~8, AVG_DAYS_BET_ORD <=26~7, # Median AVG_DAYS_BET_ORD <=30~6, AVG_DAYS_BET_ORD <=50~5, AVG_DAYS_BET_ORD <=100~4, AVG_DAYS_BET_ORD <=210~3, # Mean AVG_DAYS_BET_ORD <=300~2,TRUE~1 ) )# Create Recency Score based on DAYS_AF_LAST_ORDfull_data_customer <- full_data_customer %>%mutate(RECENCY_SCORE =case_when( DAYS_AF_LAST_ORD <=7~10, DAYS_AF_LAST_ORD <=13~9, # 1st quartile DAYS_AF_LAST_ORD <=20~8, DAYS_AF_LAST_ORD <=27~7, #Median DAYS_AF_LAST_ORD <=40~6, DAYS_AF_LAST_ORD <=50~5, DAYS_AF_LAST_ORD <=72~4, #Mean DAYS_AF_LAST_ORD <=90~3, #3rd quartile DAYS_AF_LAST_ORD <=180~2, #Six monthsTRUE~1 ) )# Create Volume Score based on TOTAL_ORDEREDfull_data_customer <- full_data_customer %>%mutate(VOLUME_SCORE =case_when( TOTAL_ORDERED >=300000~10, TOTAL_ORDERED >=100000~9, TOTAL_ORDERED >=5000~8, TOTAL_ORDERED >=2000~7, TOTAL_ORDERED >=1267~6, # Mean TOTAL_ORDERED >=815~5, # 3rd quartile TOTAL_ORDERED >=400~4, # Threshold TOTAL_ORDERED >=302~3, # Median TOTAL_ORDERED >=200~2, TRUE~1 ) )# Calculate the overall RFM Score as the sum of Recency, Frequency, Order Interval, and Volume scoresfull_data_customer <- full_data_customer %>%mutate(RFM_SCORE = RECENCY_SCORE + ORDER_FREQUENCY_SCORE + ORDER_INTERVAL_SCORE + VOLUME_SCORE )# Count the number of customers in each RFM_SCORE rangerfm_distribution <- full_data_customer %>%mutate(RFM_CATEGORY =case_when( RFM_SCORE <=10~"4-10", RFM_SCORE <=20~"11-20", RFM_SCORE <=30~"21-30",TRUE~"31-40" )) %>%group_by(RFM_CATEGORY) %>%summarise(CUSTOMER_COUNT =n(), .groups ="drop") %>%mutate(PERCENTAGE = CUSTOMER_COUNT /sum(CUSTOMER_COUNT) *100)# Reorder RFM_CATEGORY to ensure it starts with scores between 4 and 10rfm_distribution$RFM_CATEGORY <-factor(rfm_distribution$RFM_CATEGORY, levels =c("4-10", "11-20", "21-30", "31-40"))# Plot the distribution of RFM scoresggplot(rfm_distribution, aes(x = RFM_CATEGORY, y = PERCENTAGE, fill = RFM_CATEGORY)) +geom_bar(stat ="identity", show.legend =FALSE) +geom_text(aes(label =paste0(round(PERCENTAGE, 1), "%")), vjust =-0.3, size =4) +scale_fill_brewer(palette ="Set3") +# Use Set3 color palettelabs(title ="Distribution of Customers by RFM Score",x ="RFM Score Range",y ="Percentage of Customers") +theme_minimal() +theme(axis.text.x =element_text(hjust =0.5),panel.grid.major.x =element_blank(),panel.grid.minor.x =element_blank(),axis.text =element_text(size =10),axis.title =element_text(size =11) )# Remove unnecessary intermediate data framerm(rfm_distribution)```The adapted RFM Score is a method developed to condense various pieces of information related to store consumption. It was observed that 60% of stores have a score up to 20 (the median), 32% have scores between 21-30, and 8.5% have scores above 30. This suggests that only a small percentage of stores exhibit high consumption patterns.```{r}# Filter only customers where LOCAL_FOUNT_ONLY == 1rfm_distribution_lfo <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%mutate(RFM_CATEGORY =case_when( RFM_SCORE <=10~"4-10", RFM_SCORE <=20~"11-20", RFM_SCORE <=30~"21-30",TRUE~"31-40" )) %>%group_by(RFM_CATEGORY) %>%summarise(CUSTOMER_COUNT =n(), .groups ="drop") %>%mutate(PERCENTAGE = CUSTOMER_COUNT /sum(CUSTOMER_COUNT) *100)# Reorder RFM_CATEGORY to ensure it starts with scores between 4 and 10rfm_distribution_lfo$RFM_CATEGORY <-factor(rfm_distribution_lfo$RFM_CATEGORY, levels =c("4-10", "11-20", "21-30", "31-40"))# Plot the distribution of RFM scores for LOCAL_FOUNT_ONLY == 1ggplot(rfm_distribution_lfo, aes(x = RFM_CATEGORY, y = PERCENTAGE, fill = RFM_CATEGORY)) +geom_bar(stat ="identity", show.legend =FALSE) +geom_text(aes(label =paste0(round(PERCENTAGE, 1), "%")), vjust =-0.3, size =4) +scale_fill_brewer(palette ="Set3") +# Use Set3 color palettelabs(title ="Distribution of Customers by RFM Score (LOCAL_FOUNT_ONLY = 1)",x ="RFM Score Range",y ="Percentage of Customers") +theme_minimal() +theme(axis.text.x =element_text(hjust =0.5),panel.grid.major.x =element_blank(),panel.grid.minor.x =element_blank(),axis.text =element_text(size =10),axis.title =element_text(size =11) )# Remove unnecessary intermediate data framerm(rfm_distribution_lfo)```For customers who are local partners and consume only fountain drinks, it is clear that their consumption patterns are even lower. Nearly 74% of them have scores up to 20, and among the remaining customers, less than 3.6% have scores above 30.### 5.3 Customer Demand and Growth#### 5.3.1 Low Demand CustomersIt is known that a few customers exhibit very high consumption volumes, causing the average to be skewed above the median. The table below explores metrics related to customers whose demand falls below the first quartile.```{r}# Summarize the metricsdata_summary <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Avg_Vol_Cust =round(mean((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), na.rm =TRUE)),Median_Vol_Cust =round(median((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), na.rm =TRUE)),First_Quartile_Vol =round(quantile((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), 0.25, na.rm =TRUE)),.groups ='drop' )# Calculate the first quartile for each channelquartile_data <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(First_Quartile_Val =round(quantile((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), 0.25, na.rm =TRUE)),Tot_Vol =sum((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024), na.rm =TRUE),.groups ='drop' )# Calculate the number of customers below the first quartile and their total volumebelow_quartile_stats <- full_data_customer %>%left_join(quartile_data %>% dplyr::select(COLD_DRINK_CHANNEL, First_Quartile_Val), by ="COLD_DRINK_CHANNEL") %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Num_Customers_Below_1Q =sum(((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024)) <= First_Quartile_Val, na.rm =TRUE),Vol_Below_1Q =sum(((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024))[((QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024)) <= First_Quartile_Val], na.rm =TRUE),.groups ='drop' )# Count the total number of customers per channelcustomer_count <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(N_Cust =n(), .groups ='drop')# Combine all the datafinal_summary <- customer_count %>%left_join(data_summary, by ="COLD_DRINK_CHANNEL") %>%left_join(quartile_data, by ="COLD_DRINK_CHANNEL") %>%left_join(below_quartile_stats, by ="COLD_DRINK_CHANNEL") %>%mutate(Vol_Perct =round((Vol_Below_1Q / Tot_Vol) *100, 1),First_Quartile_Vol =as.integer(First_Quartile_Val) ) %>% dplyr::select(COLD_DRINK_CHANNEL, N_Cust, Avg_Vol_Cust, Median_Vol_Cust, First_Quartile_Vol, Num_Customers_Below_1Q, Vol_Perct)# Display the table with kable and stylingkable(final_summary, format ="html", escape =FALSE, align ="c", col.names =c("Cold Drink Channel", "Total Cust.", "Avg. Vol Cust.", "Median Vol Cust.", "1st Quartile Qtd", "Cust. Below 1st Quart", "Vol % Below 1st Quart")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:7, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="lightyellow") %>%add_header_above(c("Customers Analysis by Cold Drink Channel"=7)) %>%kable_paper("striped", full_width =FALSE)```For customers with total consumption volumes in 2023 and 2024 below the first quartile, the sums represent very low percentages, ranging from 0.3% to 3.1% of the total for each segment. In the dining segment, for example, 25% of customers showed demand below the first quartile.Some of these customers have been identified as having high growth potential, as their demand growth is above average. This happens because any increase in demand from these low-volume customers results in higher growth percentages.The low RFM scores also indicate that these customers have low recency, frequency, and total volume of purchases. Therefore, a flag, **LOW_DEMAND_CUST**, will be created, where a value of 1 will indicate low-consumption customers. With this flag, a white truck will be assigned to these customers, regardless of their growth indices.Below are the cut volumes by segment:```{r}# Extract the list of 'Cold Drink Channel' and '1st Quartile Qty'list_summary <- final_summary %>% dplyr::select(COLD_DRINK_CHANNEL, First_Quartile_Vol) %>%deframe()# Display the listlist_summary# Calculate the sum per row and assign LOW_DEMAND_CUSTfull_data_customer <- full_data_customer %>%mutate(Total_Vol_Cust = (QTD_DLV_CA_2023 + QTD_DLV_CA_2024) + (QTD_DLV_GAL_2023 + QTD_DLV_GAL_2024),LOW_DEMAND_CUST =if_else(Total_Vol_Cust <= list_summary[COLD_DRINK_CHANNEL], 1, 0) )```In the plot below, the numbers represent the percentages and the number of customers who received this flag.```{r}# Group and calculate the number of customers with LOW_DEMAND_CUST by LOCAL_FOUNT_ONLYsummary_low_demand <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY, LOW_DEMAND_CUST) %>%summarise(total_customers =n(),.groups ="drop" )# Calculate the percentage for each groupsummary_low_demand <- summary_low_demand %>%group_by(LOCAL_FOUNT_ONLY) %>%mutate(percentage = total_customers /sum(total_customers) *100 )# Plot for percentages with LOW_DEMAND_CUST as fill and LOCAL_FOUNT_ONLY as groupsggplot(summary_low_demand, aes(x =factor(LOCAL_FOUNT_ONLY), y = percentage, fill =factor(LOW_DEMAND_CUST))) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label =paste0(scales::comma(percentage, suffix ="%"), " (", total_customers, ")")),position =position_dodge(width =0.8), vjust =-0.2, size =3.5) +labs(title ="Percentage of Customers with Low Demand") +scale_fill_manual(values =c("0"="darkolivegreen", "1"="sandybrown"), labels =c("0"="Others (Above Q1)", "1"="Low Demand")) +# Set colors and labels for LOW_DEMAND_CUSTtheme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_text(face ="bold", size =10), # Add legend titlelegend.position ="right", # Position legend on the right sidelegend.box ="vertical", # Ensure vertical arrangement for the legendpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10, angle =0), # Display x-axis labels without rotationpanel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +# Set x-axis labelsguides(fill =guide_legend(title ="Low Demand Status")) # Add legend title```#### 5.3.2 Demand Variation between all storesTo measure demand growth patterns across our customer base (January 2023 - December 2024):1. **Data Preparation:** Combined monthly case and gallon deliveries for each customer into total monthly volumes.2. **Eligibility:** Required ≥6 months with positive orders for reliable analysis. Customers with <6 ordering months were classified as having no growth potential (6,026 customers).3. **Growth Calculation:** - Split each qualifying customer's order history into two equal time periods - For odd numbers of months, divided the middle month equally between periods - Calculated growth rate as: (Second Period Total - First Period Total) / First Period Total4. **Classification:** Customers with growth rates exceeding the average positive growth rate were categorized as high growth potential (HIGH_GROW_POT = 1), while all others received a standard classification (HIGH_GROW_POT = 0).```{r}# Initialize new columns in the datasetfull_data_customer$NUM_POSITIVE_SUMS <-0full_data_customer$QTD_DLV_FIRST_HALF <-0full_data_customer$QTD_DLV_SECOND_HALF <-0full_data_customer$DEMAND_VARIATION <-NA# Initialize as NA# Process each customer individuallyfor (i in1:nrow(full_data_customer)) {# Create a vector of positive sums while maintaining the chronological order POSITIVE_SUMS <-c()# Iterate over the 24 months in the correct sequencefor (j in1:24) {# Create column names year <-2023+ (j -1) %/%12 month <- (j -1) %%12+1 CA_COL <-paste0("QTD_DLV_CA_", sprintf("%04d", year), "_", sprintf("%02d", month)) GAL_COL <-paste0("QTD_DLV_GAL_", sprintf("%04d", year), "_", sprintf("%02d", month))# Check if columns exist in the datasetif (CA_COL %in%names(full_data_customer) && GAL_COL %in%names(full_data_customer)) { CA_VALUE <- full_data_customer[[CA_COL]][i] GAL_VALUE <- full_data_customer[[GAL_COL]][i]# Replace NA with 0 CA_VALUE <-ifelse(is.na(CA_VALUE), 0, CA_VALUE) GAL_VALUE <-ifelse(is.na(GAL_VALUE), 0, GAL_VALUE)# Sum values for the month SUM_VALUE <- CA_VALUE + GAL_VALUE# Add to the list if positiveif (SUM_VALUE >0) { POSITIVE_SUMS <-c(POSITIVE_SUMS, SUM_VALUE) } } }# Total number of positive operations num_operations <-length(POSITIVE_SUMS) full_data_customer$NUM_POSITIVE_SUMS[i] <- num_operations# If fewer than 6 positive sums, set values accordingly and continueif (num_operations <6) { full_data_customer$QTD_DLV_FIRST_HALF[i] <-0 full_data_customer$QTD_DLV_SECOND_HALF[i] <-0 full_data_customer$DEMAND_VARIATION[i] <-NAnext }# Initialize the two halves QTD_DLV_FIRST_HALF <-0 QTD_DLV_SECOND_HALF <-0# Split the operations into two halvesif (num_operations %%2==0) {# If even number of operations mid_point <- num_operations /2 QTD_DLV_FIRST_HALF <-sum(POSITIVE_SUMS[1:mid_point]) QTD_DLV_SECOND_HALF <-sum(POSITIVE_SUMS[(mid_point +1):num_operations]) } else {# If odd number of operations mid_point <- (num_operations +1) %/%2# Split the middle value between both halves first_part <-if(mid_point >1) POSITIVE_SUMS[1:(mid_point -1)] elsenumeric(0) central_value <- POSITIVE_SUMS[mid_point] /2 second_part <-if(mid_point < num_operations) POSITIVE_SUMS[(mid_point +1):num_operations] elsenumeric(0) QTD_DLV_FIRST_HALF <-sum(c(first_part, central_value)) QTD_DLV_SECOND_HALF <-sum(c(central_value, second_part)) }# Assign values to the dataset full_data_customer$QTD_DLV_FIRST_HALF[i] <- QTD_DLV_FIRST_HALF full_data_customer$QTD_DLV_SECOND_HALF[i] <- QTD_DLV_SECOND_HALF# Calculate demand variationif (QTD_DLV_FIRST_HALF >0) { # Avoid division by zero DEMAND_VARIATION_VALUE <- (QTD_DLV_SECOND_HALF - QTD_DLV_FIRST_HALF) / QTD_DLV_FIRST_HALF full_data_customer$DEMAND_VARIATION[i] <- DEMAND_VARIATION_VALUE } else { full_data_customer$DEMAND_VARIATION[i] <-NA }}# Create the HIGH_GROW_POT columnfull_data_customer$HIGH_GROW_POT <-0# Initialize all values to 0# Calculate the mean of DEMAND_VARIATION for positive values onlypositive_variations <- full_data_customer$DEMAND_VARIATION[full_data_customer$DEMAND_VARIATION >0]if (length(positive_variations) >0) { mean_value <-mean(positive_variations, na.rm =TRUE)# Display the calculated meancat("Calculated mean of positive DEMAND_VARIATION: ", mean_value, "\n")# Assign 1 for customers with DEMAND_VARIATION greater than the mean full_data_customer$HIGH_GROW_POT <-ifelse(!is.na(full_data_customer$DEMAND_VARIATION) & full_data_customer$DEMAND_VARIATION > mean_value, 1, full_data_customer$HIGH_GROW_POT)} else {cat("No positive DEMAND_VARIATION values found\n")}```Considering all customers, there was an average demand growth variation of 28%. However, 6,026 customers were excluded from the analysis as their growth could not be calculated due to having fewer than 6 periods of orders. For these customers, it was assumed that they have no growth potential.Below, the number of customers whose growth exceeded the average, regardless of the segment.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# CHECK CODE:# Filter the specific customercustomer_id <-600076539#600079420 (not eligible)customer_data <- full_data_customer[full_data_customer$CUSTOMER_NUMBER == customer_id, ]# Create a vector of positive sums while maintaining the chronological orderPOSITIVE_SUMS <-c()COLUMN_NAMES <-c() # Vector to store the column names used# Iterate over the 24 months in the correct sequencefor (j in1:24) {# Create column names CA_COL <-paste0("QTD_DLV_CA_", sprintf("%04d", 2023+ (j -1) %/%12), "_", sprintf("%02d", (j -1) %%12+1)) GAL_COL <-paste0("QTD_DLV_GAL_", sprintf("%04d", 2023+ (j -1) %/%12), "_", sprintf("%02d", (j -1) %%12+1))# Check if columns exist in the datasetif (CA_COL %in%names(full_data_customer) && GAL_COL %in%names(full_data_customer)) { CA_VALUE <- customer_data[[CA_COL]] GAL_VALUE <- customer_data[[GAL_COL]]# Replace NA with 0 CA_VALUE <-ifelse(is.na(CA_VALUE), 0, CA_VALUE) GAL_VALUE <-ifelse(is.na(GAL_VALUE), 0, GAL_VALUE)# Sum values for the month SUM_VALUE <- CA_VALUE + GAL_VALUE# Add to the list if positive, maintaining the natural orderif (SUM_VALUE >0) { POSITIVE_SUMS <-c(POSITIVE_SUMS, SUM_VALUE) COLUMN_NAMES <-c(COLUMN_NAMES, paste(CA_COL, "+", GAL_COL)) # Save column names used } }}# Total number of positive operationsnum_operations <-length(POSITIVE_SUMS)# Initialize the two halvesFIRST_HALF <-c()SECOND_HALF <-c()FIRST_HALF_NAMES <-c()SECOND_HALF_NAMES <-c()# If the number of operations is evenif (num_operations %%2==0) {# If even, split into two equal halves mid_point <- num_operations /2 FIRST_HALF <- POSITIVE_SUMS[1:mid_point] SECOND_HALF <- POSITIVE_SUMS[(mid_point +1):num_operations] FIRST_HALF_NAMES <- COLUMN_NAMES[1:mid_point] SECOND_HALF_NAMES <- COLUMN_NAMES[(mid_point +1):num_operations]} else {# If odd, correctly split the central sum mid_point <- (num_operations +1) %/%2# First sums go to FIRST_HALF FIRST_HALF <- POSITIVE_SUMS[1:(mid_point -1)] FIRST_HALF_NAMES <- COLUMN_NAMES[1:(mid_point -1)]# The central sum is divided between the two halves central_value <- POSITIVE_SUMS[mid_point] /2 FIRST_HALF <-c(FIRST_HALF, central_value) SECOND_HALF <-c(central_value, POSITIVE_SUMS[(mid_point +1):num_operations])# Adjust names correctly for the split central value FIRST_HALF_NAMES <-c(FIRST_HALF_NAMES, COLUMN_NAMES[mid_point]) SECOND_HALF_NAMES <-c(COLUMN_NAMES[mid_point], COLUMN_NAMES[(mid_point +1):num_operations])}# Display values correctly along with the columns usedcat("\n--- Values that make up FIRST_HALF ---\n")for (i in1:length(FIRST_HALF)) {cat(FIRST_HALF_NAMES[i], " = ", FIRST_HALF[i], "\n")}cat("\n--- Values that make up SECOND_HALF ---\n")for (i in1:length(SECOND_HALF)) {cat(SECOND_HALF_NAMES[i], " = ", SECOND_HALF[i], "\n")}# Display the sum of the two halvescat("\nSum of FIRST_HALF:", sum(FIRST_HALF))cat("\nSum of SECOND_HALF:", sum(SECOND_HALF))# Calculate percentage change between FIRST_HALF and SECOND_HALFif (sum(FIRST_HALF) ==0) {cat("\nPercentage Change: 0 (because FIRST_HALF is 0)\n")} else { percentage_change <- (sum(SECOND_HALF) -sum(FIRST_HALF)) /sum(FIRST_HALF) *100cat("\nPercentage Change: ", round(percentage_change, 2), "%\n")}# Display the number of positive operationscat("\nNumber of positive operations: ", num_operations, "\n")``````{r}# Group and calculate the percentage of customers with HIGH_GROW_POT = 1 and 0 by LFOsummary_high_growth <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY) %>%summarise(high_growth =sum(HIGH_GROW_POT ==1, na.rm =TRUE),low_growth =sum(HIGH_GROW_POT ==0, na.rm =TRUE),total_customers =n(),.groups ="drop" ) %>%mutate(pct_high_growth = high_growth / total_customers *100,pct_low_growth = low_growth / total_customers *100 )# Transform data into long format for percentagessummary_high_growth_long <- summary_high_growth %>%pivot_longer(cols =starts_with("pct_"),names_to ="growth_type",values_to ="percentage" ) %>%mutate(growth_type =factor(growth_type, levels =c("pct_low_growth", "pct_high_growth"), labels =c("Low Growth Potential", "High Growth Potential")) )# Ensure LFO is a factorsummary_high_growth_long$LOCAL_FOUNT_ONLY <-factor(summary_high_growth_long$LOCAL_FOUNT_ONLY, levels =c("0", "1"))# Plot for percentages with the legend on the sideggplot(summary_high_growth_long, aes(x = LOCAL_FOUNT_ONLY, y = percentage, fill = growth_type)) +geom_bar(stat ="identity", position ="dodge", alpha =0.6) +geom_text(aes(label = scales::comma(percentage, suffix ="%")), position =position_dodge(width =0.8), vjust =0.2, size =3.5) +labs(title ="Percentage of Customers Classified as Low or High Growth Potential") +scale_fill_manual(values =c("Low Growth Potential"="#FF6347", "High Growth Potential"="#40E0D0")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_blank(),axis.title.x =element_blank(),axis.title.y =element_blank(),legend.title =element_blank(), # Remove legend titlelegend.position ="right", # Position legend on the right sidelegend.box ="vertical", # Ensure vertical arrangement for the legendpanel.grid.major =element_blank(),panel.grid.minor =element_blank(),strip.text =element_text(size =10, face ="bold"),strip.background =element_blank(),axis.text.x =element_text(size =10),panel.spacing =unit(1, "lines"),strip.text.y =element_blank(),axis.ticks.y =element_blank() ) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +guides(fill =guide_legend(title ="Growth Potential")) # Add a legend title# Group and calculate the number of customers with HIGH_GROW_POT = 1 and 0 by LFOsummary_high_growth <- full_data_customer %>%group_by(LOCAL_FOUNT_ONLY) %>%summarise(low_growth =sum(HIGH_GROW_POT ==0, na.rm =TRUE), high_growth =sum(HIGH_GROW_POT ==1, na.rm =TRUE),.groups ="drop" )# Display the summary with the count of customers#summary_high_growth```Approximately 9% of customers (123) identified as local market partners who purchase fountain-only products show high growth potential according to the established criteria. For other customers, about 12% (3450) are classified as having high growth potential.Customers with high volumes are somewhat penalized by this criterion, as significant demand growth is more difficult to achieve. However, their substantial volume already places them as strategic partners, making them essential for close monitoring and prioritized deliveries via red trucks. For these customers, lower distribution costs allow for more competitive pricing, supporting the long-term sustainability of the partnership.#### 5.3.3 Demand Variation by Cold Drink ChannelEach customer's growth potential was considered within their respective segment. Following the same criteria as before, only customers whose demand variation exceeded the segment average were classified as high potential.Below is the calculated demand variation for each Cold Drink Channel during the period.```{r}# Define the custom color palette for COLD_DRINK_CHANNEL with unique colorscold_drink_channel_colors <-c("DINING"="#A7ADC6", "PUBLIC SECTOR"="#FF6347", "EVENT"="#B33951", "WORKPLACE"="#ABD2FA", "ACCOMMODATION"="#E377C2", "GOODS"="#FFD700", "BULK TRADE"="#8ED081", "WELLNESS"="#20B2AA", "CONVENTIONAL"="#1F77B4")# Aggregate data: mean DEMAND_VARIATION by channelsummary_growth_channel <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(CHANNEL_VAR =mean(DEMAND_VARIATION, na.rm =TRUE))# Create horizontal bar chartggplot(summary_growth_channel, aes(x = CHANNEL_VAR, y =reorder(COLD_DRINK_CHANNEL, CHANNEL_VAR), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", alpha =0.6) +geom_text(aes(label =paste0(round(CHANNEL_VAR *100, 1), "%")), hjust =-0.01, color ="black", size =3.2) +labs(title ="Average Demand Variation by Cold Drink Channel",x ="Percentage Variation (%)", y =NULL) +scale_x_continuous(labels = scales::label_percent(accuracy =0.1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10), axis.title.x =element_text(size =10), legend.position ="none",panel.grid.major =element_blank(), panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray", size =0.5) )```Dining and bulk trade are the most important channels, with customers increasing their demand by 2.1% and 5.6%, respectively, on average.Wellness experienced the highest variation at almost 10%, but it accounts for only 3.2% of the total volume sold. Goods had the second-highest variation, at 9%, and represents 10% of the total volume.```{r}# Calculate the mean DEMAND_VARIATION for each COLD_DRINK_CHANNELchannel_means <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(MEAN_DEMAND_VARIATION =mean(DEMAND_VARIATION, na.rm =TRUE))# Merge the mean values with the full_data_customerfull_data_customer <- full_data_customer %>%left_join(channel_means, by ="COLD_DRINK_CHANNEL")# Create the CHANNEL_GROWTH_POT columnfull_data_customer$CHANNEL_GROWTH_POT <-ifelse(is.na(full_data_customer$DEMAND_VARIATION), 0,ifelse(full_data_customer$DEMAND_VARIATION > full_data_customer$MEAN_DEMAND_VARIATION, 1, 0))# Remove the MEAN_DEMAND_VARIATION columnfull_data_customer <- full_data_customer %>% dplyr::select(-MEAN_DEMAND_VARIATION)# Calculate the percentage of customers with high growth potential by channelsummary_growth_channel_customers <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(pct_high_growth =mean(CHANNEL_GROWTH_POT ==1, na.rm =TRUE) *100)# Create the horizontal bar chartggplot(summary_growth_channel_customers, aes(x = pct_high_growth, y =reorder(COLD_DRINK_CHANNEL, pct_high_growth), fill = COLD_DRINK_CHANNEL)) +geom_bar(stat ="identity", alpha =0.6) +geom_text(aes(label =paste0(round(pct_high_growth, 1), "%")), hjust =-0.01, color ="black", size =3.2) +labs(title ="Percentage of Customers with High Growth Potential by Cold Drink Channel",x ="Percentage of Customers (%)", y =NULL) +scale_x_continuous(labels = scales::label_number(accuracy =1), expand =expansion(c(0, 0.05))) +scale_fill_manual(values = cold_drink_channel_colors) +# Now correctly using the defined palettetheme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10), axis.title.x =element_text(size =10), legend.position ="none", panel.grid.major =element_blank(), panel.grid.minor =element_blank(),panel.grid.major.x =element_line(color ="gray", size =0.5) )```The majority of segments showed more than 30% of stores with growth above the average for their group. Only the 'Events' segment presented a lower percentage, close to 23%. These customers will be classified as high-growth in their respective segments.The number of customers with a variation higher than the average for each cold drink channel significantly expands the high-potential customer base. Even when simulating the number of customers with 100% growth above the average, the base was still elevated. Therefore, this criterion will need further analysis before potentially being considered.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Save full_data as CSV#write.csv(full_data, file = "full_data.csv", row.names = FALSE)# Save full_data_customer as CSV#write.csv(full_data_customer, file = "full_data_customer.csv", row.names = FALSE)# Load the full_data CSV#full_data <- read.csv(file = "full_data.csv", sep = ",", stringsAsFactors = FALSE)# Load the full_data_customer CSV#full_data_customer <- read.csv(file = "full_data_customer.csv", sep = ",", stringsAsFactors = FALSE)```## 6. Correlations**Customer Features X RFM_SCORE**Seeking to understand how the variables correlate, based on our understanding of the dataset and with the goal of obtaining clear information without multicollinearity, we chose to select numeric variables and display only the most significant correlations (disregarding the range between -0.2 and 0.2).```{r, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# List of selected variablesselected_vars <-c("LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER", "DAYS_ONBOARDING", "DAYS_FIRST_DLV", "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")# Select only the numeric variables from the datasetnumeric_vars <- full_data_customer %>% dplyr::select(all_of(selected_vars)) %>% dplyr::select(where(is.numeric))# Compute the correlation matrix (handling missing values)cor_matrix <-cor(numeric_vars, use ="pairwise.complete.obs")# Replace NAs in correlation matrix with 0 to avoid errorscor_matrix[is.na(cor_matrix)] <-0# Remove variables with a perfect correlation of 1cor_matrix[cor_matrix ==1] <-NA# Set correlations of 1 to NA to exclude them# Convert correlation matrix to long formatcor_df <-as.data.frame(cor_matrix) %>%rownames_to_column(var ="Variable1") %>%pivot_longer(cols =-Variable1, names_to ="Variable2", values_to ="Correlation") %>%filter(!is.na(Correlation)) %>%# Remove NAs which represent correlations of 1filter((Correlation >=0.20& Correlation <=0.99) | (Correlation <=-0.20& Correlation >=-0.99)) %>%# Keep only correlations outside of the -0.20 to 0.20 rangemutate(Correlation =round(Correlation, 2)) %>%# Round correlations to 2 decimal placesmutate(pair_id =paste0(pmin(Variable1, Variable2), "-", pmax(Variable1, Variable2))) %>%distinct(pair_id, .keep_all =TRUE) %>%# Remove duplicate pairs (A-B, B-A) dplyr::select(Variable1, Variable2, Correlation) %>%arrange(desc(Correlation)) # Sort by correlation value from highest to lowest# Display the correlation matrix in kable format with stylingcor_df %>%kable("html", col.names =c("Variable 1", "Variable 2", "Correlation")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:3, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="lightgray") %>%add_header_above(c("Correlations Between Selected Variables (over -+0.2)"=3)) %>%kable_paper("striped", full_width =FALSE)``````{r}# Compute the correlation matrix for the selected numeric variablescorrelation_matrix <-cor(numeric_vars, use ="pairwise.complete.obs")# Replace NAs with 0 to avoid errorscorrelation_matrix[is.na(correlation_matrix)] <-0# Visualize the correlation matrix with rotated labelscorrplot(correlation_matrix, method ="circle", type ="upper", tl.cex =0.8, tl.col ="black", tl.srt =45, number.cex =0.6, diag =FALSE, # Remove diagonalcol =colorRampPalette(c("blue", "white", "red"))(200)) # Color palette```The strongest correlations were observed between Days Onboarding and Days After First Delivery (0.7) and between the order types MyCoke Legacy and MyCoke 360 (0.66). Both relationships make sense: customers who onboarded earlier tend to have older orders, except for cases where a new store belongs to an established chain. Similarly, customers who previously used the legacy channel transitioned to the newer 360 platform.There is a correlation of 0.53 between overall customer growth and growth within the Cold Drink Channel, suggesting that expansion trends align across segments. The RFM Score also correlates with various variables that were not directly considered in its calculation, with correlations ranging from 0.44 to 0.27.Among the negative correlations, the most notable is between RFM Score and Low Demand Customer (-0.65), indicating that lower RFM scores effectively capture low-demand customers.**Census X Total Ordered**All the correlations between the 2023 updated census data showed very low correlations, close to zero, in relation to the customers' consumption patterns.For this reason, these variables will be excluded, along with others no longer required, to streamline `full_data_customer`. However, the process will be retained in this document, as the company may obtain different results when applying real locations.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}full_data_customer <- full_data_customer %>% dplyr::select(-c("TRANSACTIONS_DATE_COUNT", "DLVT_BOTH", "DLVT_CASES", "DLVT_GALLONS", "DLVT_ORDER_LOAD", "DLVT_RETURN_GALLONS", "DLVT_RETURN_CASES", "DLVT_RETURN_BOTH", "TOTAL_CASES_ORDERED", "TOTAL_CASES_DELIVERED", "TOTAL_GALLONS_ORDERED", "TOTAL_GALLONS_DELIVERED", "TOTAL_CASES_RETURNED", "TOTAL_GALLONS_RETURNED", "TRANS_DLV_CA_2023_01", "TRANS_DLV_CA_2023_02", "TRANS_DLV_CA_2023_03", "TRANS_DLV_CA_2023_04", "TRANS_DLV_CA_2023_05", "TRANS_DLV_CA_2023_06", "TRANS_DLV_CA_2023_07", "TRANS_DLV_CA_2023_08", "TRANS_DLV_CA_2023_09", "TRANS_DLV_CA_2023_10", "TRANS_DLV_CA_2023_11", "TRANS_DLV_CA_2023_12", "TRANS_DLV_CA_2024_01", "TRANS_DLV_CA_2024_02", "TRANS_DLV_CA_2024_03", "TRANS_DLV_CA_2024_04", "TRANS_DLV_CA_2024_05", "TRANS_DLV_CA_2024_06", "TRANS_DLV_CA_2024_07", "TRANS_DLV_CA_2024_08", "TRANS_DLV_CA_2024_09", "TRANS_DLV_CA_2024_10", "TRANS_DLV_CA_2024_11", "TRANS_DLV_CA_2024_12", "TRANS_DLV_GAL_2023_01", "TRANS_DLV_GAL_2023_02", "TRANS_DLV_GAL_2023_03", "TRANS_DLV_GAL_2023_04", "TRANS_DLV_GAL_2023_05", "TRANS_DLV_GAL_2023_06", "TRANS_DLV_GAL_2023_07", "TRANS_DLV_GAL_2023_08", "TRANS_DLV_GAL_2023_09", "TRANS_DLV_GAL_2023_10", "TRANS_DLV_GAL_2023_11", "TRANS_DLV_GAL_2023_12", "TRANS_DLV_GAL_2024_01", "TRANS_DLV_GAL_2024_02", "TRANS_DLV_GAL_2024_03", "TRANS_DLV_GAL_2024_04", "TRANS_DLV_GAL_2024_05", "TRANS_DLV_GAL_2024_06", "TRANS_DLV_GAL_2024_07", "TRANS_DLV_GAL_2024_08", "TRANS_DLV_GAL_2024_09", "TRANS_DLV_GAL_2024_10", "TRANS_DLV_GAL_2024_11", "TRANS_DLV_GAL_2024_12", "TRANS_ORD_CA_2023_01", "TRANS_ORD_CA_2023_02", "TRANS_ORD_CA_2023_03", "TRANS_ORD_CA_2023_04", "TRANS_ORD_CA_2023_05", "TRANS_ORD_CA_2023_06", "TRANS_ORD_CA_2023_07", "TRANS_ORD_CA_2023_08", "TRANS_ORD_CA_2023_09", "TRANS_ORD_CA_2023_10", "TRANS_ORD_CA_2023_11", "TRANS_ORD_CA_2023_12", "TRANS_ORD_CA_2024_01", "TRANS_ORD_CA_2024_02", "TRANS_ORD_CA_2024_03", "TRANS_ORD_CA_2024_04", "TRANS_ORD_CA_2024_05", "TRANS_ORD_CA_2024_06", "TRANS_ORD_CA_2024_07", "TRANS_ORD_CA_2024_08", "TRANS_ORD_CA_2024_09", "TRANS_ORD_CA_2024_10", "TRANS_ORD_CA_2024_11", "TRANS_ORD_CA_2024_12", "TRANS_ORD_GAL_2023_01", "TRANS_ORD_GAL_2023_02", "TRANS_ORD_GAL_2023_03", "TRANS_ORD_GAL_2023_04", "TRANS_ORD_GAL_2023_05", "TRANS_ORD_GAL_2023_06", "TRANS_ORD_GAL_2023_07", "TRANS_ORD_GAL_2023_08", "TRANS_ORD_GAL_2023_09", "TRANS_ORD_GAL_2023_10", "TRANS_ORD_GAL_2023_11", "TRANS_ORD_GAL_2023_12", "TRANS_ORD_GAL_2024_01", "TRANS_ORD_GAL_2024_02", "TRANS_ORD_GAL_2024_03", "TRANS_ORD_GAL_2024_04", "TRANS_ORD_GAL_2024_05", "TRANS_ORD_GAL_2024_06", "TRANS_ORD_GAL_2024_07", "TRANS_ORD_GAL_2024_08", "TRANS_ORD_GAL_2024_09", "TRANS_ORD_GAL_2024_10", "TRANS_ORD_GAL_2024_11", "TRANS_ORD_GAL_2024_12", "TRANS_RET_CA_2023_01", "TRANS_RET_CA_2023_02", "TRANS_RET_CA_2023_03", "TRANS_RET_CA_2023_04", "TRANS_RET_CA_2023_05", "TRANS_RET_CA_2023_06", "TRANS_RET_CA_2023_07", "TRANS_RET_CA_2023_08", "TRANS_RET_CA_2023_09", "TRANS_RET_CA_2023_10", "TRANS_RET_CA_2023_11", "TRANS_RET_CA_2023_12", "TRANS_RET_CA_2024_01", "TRANS_RET_CA_2024_02", "TRANS_RET_CA_2024_03", "TRANS_RET_CA_2024_04", "TRANS_RET_CA_2024_05", "TRANS_RET_CA_2024_06", "TRANS_RET_CA_2024_07", "TRANS_RET_CA_2024_08", "TRANS_RET_CA_2024_09", "TRANS_RET_CA_2024_10", "TRANS_RET_CA_2024_11", "TRANS_RET_CA_2024_12", "TRANS_RET_GAL_2023_01", "TRANS_RET_GAL_2023_02", "TRANS_RET_GAL_2023_03", "TRANS_RET_GAL_2023_04", "TRANS_RET_GAL_2023_05", "TRANS_RET_GAL_2023_06", "TRANS_RET_GAL_2023_07", "TRANS_RET_GAL_2023_08", "TRANS_RET_GAL_2023_09", "TRANS_RET_GAL_2023_10", "TRANS_RET_GAL_2023_11", "TRANS_RET_GAL_2023_12", "TRANS_RET_GAL_2024_01", "TRANS_RET_GAL_2024_02", "TRANS_RET_GAL_2024_03", "TRANS_RET_GAL_2024_04", "TRANS_RET_GAL_2024_05", "TRANS_RET_GAL_2024_06", "TRANS_RET_GAL_2024_07", "TRANS_RET_GAL_2024_08", "TRANS_RET_GAL_2024_09", "TRANS_RET_GAL_2024_10", "TRANS_RET_GAL_2024_11", "TRANS_RET_GAL_2024_12", "QTD_DLV_CA_2023_01", "QTD_DLV_CA_2023_02", "QTD_DLV_CA_2023_03", "QTD_DLV_CA_2023_04", "QTD_DLV_CA_2023_05", "QTD_DLV_CA_2023_06", "QTD_DLV_CA_2023_07", "QTD_DLV_CA_2023_08", "QTD_DLV_CA_2023_09", "QTD_DLV_CA_2023_10", "QTD_DLV_CA_2023_11", "QTD_DLV_CA_2023_12", "QTD_DLV_CA_2024_01", "QTD_DLV_CA_2024_02", "QTD_DLV_CA_2024_03", "QTD_DLV_CA_2024_04", "QTD_DLV_CA_2024_05", "QTD_DLV_CA_2024_06", "QTD_DLV_CA_2024_07", "QTD_DLV_CA_2024_08", "QTD_DLV_CA_2024_09", "QTD_DLV_CA_2024_10", "QTD_DLV_CA_2024_11", "QTD_DLV_CA_2024_12", "QTD_DLV_GAL_2023_01", "QTD_DLV_GAL_2023_02", "QTD_DLV_GAL_2023_03", "QTD_DLV_GAL_2023_04", "QTD_DLV_GAL_2023_05", "QTD_DLV_GAL_2023_06", "QTD_DLV_GAL_2023_07", "QTD_DLV_GAL_2023_08", "QTD_DLV_GAL_2023_09", "QTD_DLV_GAL_2023_10", "QTD_DLV_GAL_2023_11", "QTD_DLV_GAL_2023_12", "QTD_DLV_GAL_2024_01", "QTD_DLV_GAL_2024_02", "QTD_DLV_GAL_2024_03", "QTD_DLV_GAL_2024_04", "QTD_DLV_GAL_2024_05", "QTD_DLV_GAL_2024_06", "QTD_DLV_GAL_2024_07", "QTD_DLV_GAL_2024_08", "QTD_DLV_GAL_2024_09", "QTD_DLV_GAL_2024_10", "QTD_DLV_GAL_2024_11", "QTD_DLV_GAL_2024_12", "QTD_ORD_CA_2023_01", "QTD_ORD_CA_2023_02", "QTD_ORD_CA_2023_03", "QTD_ORD_CA_2023_04", "QTD_ORD_CA_2023_05", "QTD_ORD_CA_2023_06", "QTD_ORD_CA_2023_07", "QTD_ORD_CA_2023_08", "QTD_ORD_CA_2023_09", "QTD_ORD_CA_2023_10", "QTD_ORD_CA_2023_11", "QTD_ORD_CA_2023_12", "QTD_ORD_CA_2024_01", "QTD_ORD_CA_2024_02", "QTD_ORD_CA_2024_03", "QTD_ORD_CA_2024_04", "QTD_ORD_CA_2024_05", "QTD_ORD_CA_2024_06", "QTD_ORD_CA_2024_07", "QTD_ORD_CA_2024_08", "QTD_ORD_CA_2024_09", "QTD_ORD_CA_2024_10", "QTD_ORD_CA_2024_11", "QTD_ORD_CA_2024_12", "QTD_ORD_GAL_2023_01", "QTD_ORD_GAL_2023_02", "QTD_ORD_GAL_2023_03", "QTD_ORD_GAL_2023_04", "QTD_ORD_GAL_2023_05", "QTD_ORD_GAL_2023_06", "QTD_ORD_GAL_2023_07", "QTD_ORD_GAL_2023_08", "QTD_ORD_GAL_2023_09", "QTD_ORD_GAL_2023_10", "QTD_ORD_GAL_2023_11", "QTD_ORD_GAL_2023_12", "QTD_ORD_GAL_2024_01", "QTD_ORD_GAL_2024_02", "QTD_ORD_GAL_2024_03", "QTD_ORD_GAL_2024_04", "QTD_ORD_GAL_2024_05", "QTD_ORD_GAL_2024_06", "QTD_ORD_GAL_2024_07", "QTD_ORD_GAL_2024_08", "QTD_ORD_GAL_2024_09", "QTD_ORD_GAL_2024_10", "QTD_ORD_GAL_2024_11", "QTD_ORD_GAL_2024_12", "QTD_RET_CA_2023_01", "QTD_RET_CA_2023_02", "QTD_RET_CA_2023_03", "QTD_RET_CA_2023_04", "QTD_RET_CA_2023_05", "QTD_RET_CA_2023_06", "QTD_RET_CA_2023_07", "QTD_RET_CA_2023_08", "QTD_RET_CA_2023_09", "QTD_RET_CA_2023_10", "QTD_RET_CA_2023_11", "QTD_RET_CA_2023_12", "QTD_RET_CA_2024_01", "QTD_RET_CA_2024_02", "QTD_RET_CA_2024_03", "QTD_RET_CA_2024_04", "QTD_RET_CA_2024_05", "QTD_RET_CA_2024_06", "QTD_RET_CA_2024_07", "QTD_RET_CA_2024_08", "QTD_RET_CA_2024_09", "QTD_RET_CA_2024_10", "QTD_RET_CA_2024_11", "QTD_RET_CA_2024_12", "QTD_RET_GAL_2023_01", "QTD_RET_GAL_2023_02", "QTD_RET_GAL_2023_03", "QTD_RET_GAL_2023_04", "QTD_RET_GAL_2023_05", "QTD_RET_GAL_2023_06", "QTD_RET_GAL_2023_07", "QTD_RET_GAL_2023_08", "QTD_RET_GAL_2023_09", "QTD_RET_GAL_2023_10", "QTD_RET_GAL_2023_11", "QTD_RET_GAL_2023_12", "QTD_RET_GAL_2024_01", "QTD_RET_GAL_2024_02", "QTD_RET_GAL_2024_03", "QTD_RET_GAL_2024_04", "QTD_RET_GAL_2024_05", "QTD_RET_GAL_2024_06", "QTD_RET_GAL_2024_07", "QTD_RET_GAL_2024_08", "QTD_RET_GAL_2024_09", "QTD_RET_GAL_2024_10", "QTD_RET_GAL_2024_11", "QTD_RET_GAL_2024_12")) # List all variables in the environmentall_vars <-ls()# Exclude 'full_data', 'full_data_customer', and the new variables from removalvars_to_keep <-c("full_data", "full_data_customer","mydir", "one_seed", "reference_date")# Get the variables to removevars_to_remove <-setdiff(all_vars, vars_to_keep)# Remove the temporary data framesrm(list = vars_to_remove)# Clean up by removing 'all_vars' and 'vars_to_remove'rm(all_vars, vars_to_remove)```## 7. Customer SegmentationSince all customers in the original dataset are served by red trucks, there is no prior information on the characteristics of those who would be served by white trucks. The only available reference is the average annual consumption threshold of 400 gallons or cases.To address this, customers were segmented based on their most relevant characteristics within the available scope, including variables created during the analysis.Variables selected represent store-level traits or consumption behavior, with geographic and census data excluded.The variables selected are listed below:**Customer Type & Relationship**: These variables represent customers' relationship with the company and their type: - LOCAL_FOUNT_ONLY: Customers who only consume fountain drinks. - LOCAL_MARKET_PARTNER: Local market partners. - CO2_CUSTOMER: Customers who are CO2 consumers. - CHAIN_MEMBER: Customers who are part of a chain. **Time-Related Metrics**: Time-related metrics track customers' activity and engagement over time: - DAYS_ONBOARDING: Number of days since onboarding. - DAYS_FIRST_DLV: Number of days since the first delivery. - DAYS_AF_LAST_ORD: Number of days after the last order. - AVG_DAYS_BET_ORD: Average number of days between orders. **Order & Sales Behavior**: These variables represent customer behaviors in terms of orders and sales: - NUM_ORDERS: Total number of orders. - TOTAL_ORDERED: Total volume of orders. - RFM_SCORE: Adapted Recency, Frequency, Monetary score. - TOTAL_COST_CA_GAL: Total cost in deliveries for 2023 and 2024. **Order Channels**: This category contains data on the various channels through which customers make their transactions: - OT_CALL.CENTER: Transactions via call center. - OT_OTHER: Transactions made through other means (emails, trade fairs, etc.). - OT_SALES.REP: Transactions via sales representatives. - OT_MYCOKE: Transactions via MyCoke (legacy platform). - OT_EDI: Transactions via electronic direct ordering (EDI). **Growth & Demand Potential**: These flags indicate customers' growth and demand potential: - HIGH_GROW_POT: Flag for customers with above-average growth potential across all segments. - CHANNEL_GROWTH_POT: Flag for customers with above-average growth within their segment. - LOW_DEMAND_CUST: Flag for customers with low demand (below the 1st quartile) by segment.Three variables have a wide range of values with extreme outliers. For these variables—**NUM_ORDERS**, **TOTAL_ORDERED**, and **TOTAL_COST_CA_GAL**—we will apply a logarithmic transformation.```{r, warning=FALSE}# Select the primary variables for clustering based on business relevanceselected_vars <-c("LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER", "DAYS_ONBOARDING", "DAYS_FIRST_DLV", "DAYS_AF_LAST_ORD", "AVG_DAYS_BET_ORD", "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", "NUM_ORDERS", "TOTAL_ORDERED", "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")# Extract the data and apply log transformation to NUM_ORDERS and TOTAL_ORDEREDdata_to_cluster <- full_data_customer %>% dplyr::select(all_of(selected_vars)) %>% dplyr::select(where(is.numeric))# Apply log transformation directly on the selected numeric variablesdata_to_cluster$DAYS_ONBOARDING <-log1p(data_to_cluster$DAYS_ONBOARDING)data_to_cluster$DAYS_FIRST_DLV <-log1p(data_to_cluster$DAYS_FIRST_DLV)data_to_cluster$TOTAL_ORDERED <-log1p(data_to_cluster$TOTAL_ORDERED)data_to_cluster$TOTAL_COST_CA_GAL <-log1p(data_to_cluster$TOTAL_COST_CA_GAL)# Standardize the numeric variables for clusteringdata_to_cluster <-scale(data_to_cluster)# Determine optimal number of clusters using the Elbow Methodset.seed(500) wss <-sapply(1:10, function(k) kmeans(data_to_cluster, centers = k, nstart =25)$tot.withinss)# Visualize the Elbow Method resultsplot(1:10, wss, type ="b", pch =19, frame =FALSE, xlab ="Number of Clusters", ylab ="Total Within Sum of Squares (WSS)", main ="Elbow Method for Optimal K")```After testing different compositions to calculate the silhouette score and ARI score—varying the number of clusters from 2 to 4, using multiple distance metrics (Euclidean, Manhattan), and applying different algorithms (Hartigan-Wong, MacQueen, Lloyd)—the most relevant metrics are presented below.```{r, warning=FALSE}# Set seed for reproducibilityset.seed(500)# Function to calculate the Silhouette Scorecalculate_silhouette_score <-function(model, data) { clusters <- model$cluster if (length(clusters) !=nrow(data)) {stop("Cluster assignments do not match the number of data points.") } dist_matrix <-dist(data) silhouette_score <-silhouette(clusters, dist_matrix)return(mean(silhouette_score[, 3])) }# Function to calculate Adjusted Rand Indexcalculate_ari <-function(model, true_labels) { clusters <- model$cluster ari_score <-adjustedRandIndex(clusters, true_labels)return(ari_score)}# Select the primary variables for clustering based on business relevanceselected_vars <-c("LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER", "DAYS_ONBOARDING", "DAYS_FIRST_DLV", "DAYS_AF_LAST_ORD", "AVG_DAYS_BET_ORD", "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", "NUM_ORDERS", "TOTAL_ORDERED", "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")# Extract the data and apply log transformation to NUM_ORDERS and TOTAL_ORDEREDdata_to_cluster <- full_data_customer %>% dplyr::select(all_of(selected_vars)) %>% dplyr::select(where(is.numeric))# Apply log transformation directly on the selected numeric variablesdata_to_cluster$DAYS_ONBOARDING <-log1p(data_to_cluster$DAYS_ONBOARDING)data_to_cluster$DAYS_FIRST_DLV <-log1p(data_to_cluster$DAYS_FIRST_DLV)data_to_cluster$TOTAL_ORDERED <-log1p(data_to_cluster$TOTAL_ORDERED)data_to_cluster$TOTAL_COST_CA_GAL <-log1p(data_to_cluster$TOTAL_COST_CA_GAL)# Standardize the numeric variables for clusteringdata_to_cluster <-scale(data_to_cluster)# Define different parameter configurations for K-meansparams <-list(list(name ="Euclidean, 2 Clusters", centers =2, nstart =25, algorithm ="Hartigan-Wong"),list(name ="Euclidean, 3 Clusters", centers =3, nstart =25, algorithm ="Hartigan-Wong"),list(name ="Euclidean, 4 Clusters", centers =4, nstart =25, algorithm ="Hartigan-Wong")# ,# list(name = "Manhattan, 3 Clusters", centers = 2, nstart = 25, algorithm = "MacQueen"),# list(name = "Manhattan, 2 Clusters", centers = 3, nstart = 25, algorithm = "MacQueen"),# list(name = "Manhattan, 4 Clusters", centers = 4, nstart = 25, algorithm = "MacQueen"),# list(name = "K-means++, 3 Clusters", centers = 2, nstart = 25, algorithm = "Lloyd"),# list(name = "K-means++, 2 Clusters", centers = 3, nstart = 25, algorithm = "Lloyd"),# list(name = "K-means++, 4 Clusters", centers = 4, nstart = 25, algorithm = "Lloyd"))# Apply K-means clustering and store resultsresults <-lapply(params, function(param) { model <-kmeans(data_to_cluster, centers = param$centers, nstart = param$nstart, algorithm = param$algorithm) silhouette <-calculate_silhouette_score(model, data_to_cluster) ari <-calculate_ari(model, full_data_customer$LOCAL_MARKET_PARTNER) # You can replace with a true label column if neededreturn(data.frame(Model = param$name, Silhouette_Score =round(silhouette, 3), ARI =round(ari, 3)))})# Combine results into a single tableresults_df <-do.call(rbind, results)# Display table using kablekable(results_df, col.names =c("Parameter", "Silhouette Score", "Adjusted Rand Index (ARI)"))```Given the results, the combination "Euclidean, 3 Clusters" was selected, using `centers = 3`, `nstart = 25`, and the "Hartigan-Wong" algorithm (default), as it demonstrated the best performance among the tested options. Still, the separation between clusters remains marginal and relatively weak.Below is the visualization of the clusters based on the two principal components.```{r}# Implement K-means with optimal number of clusters set.seed(500)kmeans_optimal <-kmeans(data_to_cluster, centers =3, nstart =25, algorithm ="Hartigan-Wong")# Add cluster assignments to the original datasetfull_data_customer$CLUSTER <- kmeans_optimal$cluster# Define custom colors for the clusterspalette_clusters <-c("1"="#FF6347", # Coral"2"="#4682B4", # Cornflower blue"3"="#FFD700") # Yellow# Visualize cluster distribution with PCA-reduced dimensionsfviz_cluster(kmeans_optimal, data = data_to_cluster, geom ="point", ellipse.type ="none", main ="Customer Segmentation: PCA-based Visualization",subtitle ="K-means Optimal Clustering with 3 Segments",ggtheme =theme_minimal()) +scale_color_manual(values = palette_clusters) # Manually set colors```The customer segmentation will be discussed later, including the interpretation of each cluster.### 7.1 Clusters and principal components Given the visualization of the clusters through their principal components, the decision was made to further explore the characteristics of the two main components, as they account for 39% of the total variability.```{r}# Select the desired variables for clustering - adjusted to match your clustering selectionselected_vars <-c("LOCAL_FOUNT_ONLY", "LOCAL_MARKET_PARTNER", "CO2_CUSTOMER", "CHAIN_MEMBER","DAYS_ONBOARDING", "DAYS_FIRST_DLV", "DAYS_AF_LAST_ORD", "AVG_DAYS_BET_ORD", "OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI", "NUM_ORDERS", "TOTAL_ORDERED", "RFM_SCORE", "HIGH_GROW_POT", "CHANNEL_GROWTH_POT", "LOW_DEMAND_CUST", "TOTAL_COST_CA_GAL")# Select only the desired variables from full_data_customer - works for both data.frame and data.tablecustomer_data <- full_data_customer[, selected_vars]# Remove rows with NA values (if any)customer_data <-na.omit(customer_data)# Apply log transformation to the same variables as in clusteringcustomer_data$DAYS_ONBOARDING <-log1p(customer_data$DAYS_ONBOARDING)customer_data$DAYS_FIRST_DLV <-log1p(customer_data$DAYS_FIRST_DLV)customer_data$TOTAL_ORDERED <-log1p(customer_data$TOTAL_ORDERED)customer_data$TOTAL_COST_CA_GAL <-log1p(customer_data$TOTAL_COST_CA_GAL)# Build scales for the datasetscales <-build_scales(customer_data, verbose =FALSE)# Scaling columnscustomer_data <-fast_scale(customer_data, scales = scales, verbose =FALSE)# Calculating the covariance matrixcov_customer <-cov(customer_data)# Calculating the Eigenvector and Eigenvalues of the variance-covariance matrixe_customer <-eigen(cov_customer)eigenvalues_customer <- e_customer$valueseigenvectors_customer <- e_customer$vectors# Print#print(paste("Counts the number of eigenvalues:", length(eigenvalues_customer)))# Initialize an empty matrix to store the contributions of variables to all PCscontributions_matrix <-matrix(NA, nrow =ncol(customer_data), ncol =ncol(eigenvectors_customer))# Loop through all principal componentsfor (i in1:ncol(eigenvectors_customer)) {# Get the contributions of variables to the i-th principal component (PC) pc_contributions <- eigenvectors_customer[, i]# Assign the contributions to the corresponding column in the matrix contributions_matrix[, i] <-round(pc_contributions, 2) # Round to 2 decimal places}# Convert the matrix to a data frame and assign appropriate row and column namescontributions_df <-as.data.frame(contributions_matrix)colnames(contributions_df) <-paste0("PC", 1:ncol(contributions_matrix)) # Name the columns dynamicallyrownames(contributions_df) <-colnames(customer_data) # Assign the variable names as row names# Variance Explained# Calculate the variance explained by each principal componentvariance_explained <- eigenvalues_customer /sum(eigenvalues_customer)# Round the variance explained to 2 decimal placesvariance_row <-round(variance_explained, 2)# Calculate the cumulative variance explainedcumulative_variance <-cumsum(variance_explained)# Round the cumulative variance to 2 decimal placescumulative_variance_row <-round(cumulative_variance, 2)# Add the variance and cumulative variance rows to the bottom of the data framecontributions_df <-rbind(contributions_df, Variance_Explained = variance_row,Cumulative_Variance = cumulative_variance_row)# Format the table using formattable for heatmap effectformattable(contributions_df, list(# Apply color gradient to all columnsarea(col =1:ncol(contributions_df)) ~color_tile("white", "deepskyblue3") ))```Principal Component 1 has the highest weight from the variables RFM_SCORE, NUM_ORDERS, TOTAL_ORDERED, and TOTAL_COST_CA_GAL, representing 30% of the variance. Principal Component 2 adds another 9% of variance, with the highest weight from the OT_MYCOKE variables.### 7.2 Clusters FeaturesThe clusters will be characterized based on their relationships with other variables.```{r, warning=FALSE, message=FALSE}# Define specific colors for fleet typesfleet_colors <-c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3") # Custom colors for FleetType# Create a cross-tabulation of CLUSTER and FLEET_TYPEcluster_fleet_table <-table(full_data_customer$CLUSTER, full_data_customer$FLEET_TYPE, useNA ="ifany")# Create data frame for visualizationcluster_fleet_df <-as.data.frame.table(cluster_fleet_table)names(cluster_fleet_df) <-c("Segment", "FleetType", "Count")# Filter out NA values for cleaner visualizationcluster_fleet_df <- cluster_fleet_df %>%filter(!is.na(Segment) &!is.na(FleetType))# Calculate proportionscluster_fleet_df$Pct <- cluster_fleet_df$Count /ave(cluster_fleet_df$Count, cluster_fleet_df$Segment, FUN = sum)# Create percentage distribution plot for fleet types within clustersggplot(cluster_fleet_df, aes(x = Segment, y = Pct, fill = FleetType)) +geom_bar(stat ="identity", position ="fill", width =0.7) +scale_y_continuous(labels = scales::percent) +scale_fill_manual(values = fleet_colors) +# Use custom colors for fleet typesgeom_text(aes(label = scales::percent(Pct, accuracy =0.1)), position =position_fill(vjust =0.5), color ="black", size =3.5) +# Add percentage text labels inside the barslabs(title ="Fleet Type Distribution Across Clusters",subtitle ="Fleet type classification using a 400-gallon threshold",x ="Cluster",y ="Percentage",fill ="Fleet Type") +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),legend.position ="right",panel.grid.major.x =element_blank(),panel.grid.minor =element_blank() )# Define specific colors for the other variableshigh_growth_colors <-c("1"="#FF6347", "0"="#D3D3D3") # High Growth vs Low Growthfountain_only_colors <-c("1"="#4682B4", "0"="#D3D3D3") # Fountain Only vs Not Fountain Only# Create data frame for HIGH_GROW_POT visualizationcluster_high_growth_df <-as.data.frame.table(table(full_data_customer$CLUSTER, full_data_customer$HIGH_GROW_POT))names(cluster_high_growth_df) <-c("Segment", "HighGrowth", "Count")# Filter out NA values for cleaner visualizationcluster_high_growth_df <- cluster_high_growth_df %>%filter(!is.na(Segment) &!is.na(HighGrowth))# Calculate proportions for HIGH_GROW_POTcluster_high_growth_df$Pct <- cluster_high_growth_df$Count /ave(cluster_high_growth_df$Count, cluster_high_growth_df$Segment, FUN = sum)# Plot for HIGH_GROW_POT distribution by clustersggplot(cluster_high_growth_df, aes(x = Segment, y = Pct, fill = HighGrowth)) +geom_bar(stat ="identity", position ="fill", width =0.7) +scale_y_continuous(labels = scales::percent) +scale_fill_manual(values = high_growth_colors) +# Custom colors for HIGH_GROW_POTgeom_text(aes(label = scales::percent(Pct, accuracy =0.1)), position =position_fill(vjust =0.5), color ="black", size =3.5) +# Add percentage text labels inside the barslabs(title ="Clusters by Growth Potential",subtitle ="Proportional Representation by High Growth Potential",x ="Cluster",y ="Percentage",fill ="Growth Potential") +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),legend.position ="right",panel.grid.major.x =element_blank(),panel.grid.minor =element_blank() )# Create data frame for LOCAL_FOUNT_ONLY visualizationcluster_fountain_df <-as.data.frame.table(table(full_data_customer$CLUSTER, full_data_customer$LOCAL_FOUNT_ONLY))names(cluster_fountain_df) <-c("Segment", "FountainOnly", "Count")# Filter out NA values for cleaner visualizationcluster_fountain_df <- cluster_fountain_df %>%filter(!is.na(Segment) &!is.na(FountainOnly))# Calculate proportions for LOCAL_FOUNT_ONLYcluster_fountain_df$Pct <- cluster_fountain_df$Count /ave(cluster_fountain_df$Count, cluster_fountain_df$Segment, FUN = sum)# Plot for LOCAL_FOUNT_ONLY distribution by clustersggplot(cluster_fountain_df, aes(x = Segment, y = Pct, fill = FountainOnly)) +geom_bar(stat ="identity", position ="fill", width =0.7) +scale_y_continuous(labels = scales::percent) +scale_fill_manual(values = fountain_only_colors) +# Custom colors for LOCAL_FOUNT_ONLYgeom_text(aes(label = scales::percent(Pct, accuracy =0.1)), position =position_fill(vjust =0.5), color ="black", size =3.5) +# Add percentage text labels inside the barslabs(title ="Clusters by Fountain Only",subtitle ="Proportional Representation by Fountain Only",x ="Cluster",y ="Percentage",fill ="Fountain Only") +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),legend.position ="right",panel.grid.major.x =element_blank(),panel.grid.minor =element_blank() )# Define specific colors for the LOW_DEMAND_CUST variablelow_demand_colors <-c("1"="yellow", "0"="#D3D3D3") # Low Demand vs Not Low Demand# Create data frame for LOW_DEMAND_CUST visualizationcluster_low_demand_df <-as.data.frame.table(table(full_data_customer$CLUSTER, full_data_customer$LOW_DEMAND_CUST))names(cluster_low_demand_df) <-c("Segment", "LowDemand", "Count")# Filter out NA values for cleaner visualizationcluster_low_demand_df <- cluster_low_demand_df %>%filter(!is.na(Segment) &!is.na(LowDemand))# Calculate proportions for LOW_DEMAND_CUSTcluster_low_demand_df$Pct <- cluster_low_demand_df$Count /ave(cluster_low_demand_df$Count, cluster_low_demand_df$Segment, FUN = sum)# Plot for LOW_DEMAND_CUST distribution by clustersggplot(cluster_low_demand_df, aes(x = Segment, y = Pct, fill = LowDemand)) +geom_bar(stat ="identity", position ="fill", width =0.7) +scale_y_continuous(labels = scales::percent) +scale_fill_manual(values = low_demand_colors) +# Custom colors for LOW_DEMAND_CUSTgeom_text(aes(label = scales::percent(Pct, accuracy =0.1)), position =position_fill(vjust =0.5), color ="black", size =3.5) +# Add percentage text labels inside the barslabs(title ="Clusters by Low Demand Customers",subtitle ="Proportional Representation by Low Demand Customers",x ="Cluster",y ="Percentage",fill ="Low Demand Customers") +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),legend.position ="right",panel.grid.major.x =element_blank(),panel.grid.minor =element_blank() )# Define colors from the custom palette for clusters# Reshape data for facetingplot_data <-melt(full_data_customer, id.vars ="CLUSTER", measure.vars =c("RFM_SCORE", "NUM_ORDERS", "TOTAL_ORDERED"))# Create a new variable for log-transformed TOTAL_ORDEREDfull_data_customer$LOG_TOTAL_ORDERED <-log1p(full_data_customer$TOTAL_ORDERED) # log1p to handle zero values# Reshape data using tidyr::pivot_longer()plot_data <- full_data_customer %>%pivot_longer(cols =c(RFM_SCORE, NUM_ORDERS, LOG_TOTAL_ORDERED),names_to ="variable", values_to ="value")# Rename variable levels for better readabilityplot_data$variable <-case_when( plot_data$variable =="LOG_TOTAL_ORDERED"~"TOTAL_ORDERED (Log Scale)",TRUE~ plot_data$variable)# Create a boxplot with facet_wrapggplot(plot_data, aes(x =factor(CLUSTER), y = value, fill =factor(CLUSTER))) +geom_boxplot(color ="black", alpha =0.7) +# Add black borders for contrastscale_fill_manual(values = palette_clusters) +# Apply custom colors for clustersfacet_wrap(~ variable, scales ="free_y") +# Allow different y-scales per variablelabs(title ="Customer Segmentation Characterization", subtitle ="Distribution of RFM Score, Number of Orders, and Log-Transformed Total Ordered for each cluster",x ="Cluster", y ="Value" ) +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11, color ="gray30"),axis.title =element_text(face ="bold"),panel.grid.major =element_line(color ="gray90"),panel.grid.minor =element_blank(),legend.position ="none"# Remove legend since clusters are already labeled on the x-axis )``````{r, warning=FALSE}# Prepare the datasetplot_data_filtered <- full_data_customer %>%pivot_longer(cols =c("OT_CALL.CENTER", "OT_OTHER", "OT_SALES.REP", "OT_MYCOKE.LEGACY", "OT_MYCOKE360", "OT_EDI"),names_to ="variable", values_to ="value") %>%mutate(value =log1p(value)) # Log-transform values safely to handle zero values# Define custom labels for the variablescustom_labels <-c("OT_CALL.CENTER"="Call Center","OT_OTHER"="Other","OT_SALES.REP"="Sales Rep","OT_MYCOKE.LEGACY"="MyCoke Legacy","OT_MYCOKE360"="MyCoke360","OT_EDI"="EDI")# Generate the boxplotggplot(plot_data_filtered, aes(x =factor(CLUSTER), y = value, fill =factor(CLUSTER))) +geom_boxplot(color ="black", alpha =0.7) +# Add black borders for contrastscale_fill_manual(values = palette_clusters) +# Apply custom colors for clustersfacet_wrap(~ variable, scales ="fixed", labeller =labeller(variable =as_labeller(custom_labels))) +scale_y_continuous(limits =c(0, 6), breaks =seq(0, 6, 1)) +# Set fixed scale for y-axislabs(title ="Customer Segmentation Characterization", subtitle ="Distribution of orders by Order Type (Log Scale) for each cluster",x ="Cluster",y ="Log(Value)" ) +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =11, color ="gray30"),axis.title =element_text(face ="bold"),panel.grid.major =element_line(color ="gray90"),panel.grid.minor =element_blank(),legend.position ="none"# Remove legend since clusters are already labeled on the x-axis )```**Cluster 1 (Red): High Demand Customers** - Composition: Approximately 80% of customers receive deliveries via red trucks (based on the benchmark threshold of 400 gallons on average per year). - Growth: Around 7% of customers exhibit high growth potential. - Local Fountain Only: Only 1.5% of customers are local fountain-only. - Average RFM: The average RFM score for this cluster is 29, the highest among the three clusters. - Average Number of Orders: The average number of orders per customer was 81 in 2023 and 2024, with many outliers showing significantly higher order volumes. - Total Ordered Volume: The average total ordered volume per customer in 2023 and 2024 is 4,638 gallons. This cluster has the highest number of outliers with elevated volumes, which skews the average. The volume representing the median is 1,707 gallons. - Volume Share: This cluster represents 76% of the total volume consumed in 2023 and 2024. - It has the highest number of orders through digital channels and is the cluster most served by sales representatives.**Cluster 2 (Blue): Intermediate Customers with Growth Potential** - Composition: Approximately 87% of customers receive deliveries via white trucks (based on the benchmark threshold of 400 gallons on average per year). - Growth: This cluster has the highest percentage of customers with high growth potential, at 16.6%. - Local Fountain Only: Around 4.2% of customers are local fountain-only. - Average RFM: The average RFM score for this group is 18.7, the second highest among the clusters. - Average Number of Orders: The average number of orders per customer was 30 in 2023 and 2024. - Total Ordered Volume: The average total ordered volume per customer in 2023 and 2024 is 525 gallons. The median volume is 331 gallons. - Volume Share: This cluster represents approximately 22% of the total volume consumed in 2023 and 2024. - It is the cluster with the highest average number of orders placed via the call center. It has fewer orders through digital channels compared to Cluster 2, but more than Cluster 1. The number of orders through sales representatives is similar to Cluster 1**Cluster 3 (Yellow): Less Active Customers with Low Order Volume** - Composition: Only 0.4% of customers receive deliveries via red trucks (based on the benchmark threshold of 400 gallons on average per year). - Growth: Approximately 6% of customers exhibit high growth potential. - Local Fountain Only: This cluster has the highest percentage of local fountain-only customers, at 7.5%. - Average RFM: The average RFM score is 7, indicating these are the least active customers. - Average Number of Orders: The average number of orders per customer was 5.5 in 2023 and 2024. - Total Ordered Volume: The average total ordered volume per customer in 2023 and 2024 is around 80 gallons, while the median is 57 gallons, indicating a large number of customers with smaller volumes. - Volume Share: This cluster represents only 1.7% of the total volume consumed in 2023 and 2024. - The cluster shows orders concentrated through call centers, digital channels, and sales representatives, although in smaller absolute quantities compared to the other clusters.## 8. Classification Models for Explaining ClustersTo better understand the variables influencing cluster composition and facilitate future predictions without the need for re-clustering, two classification models will be used: decision trees and multinomial logistic regression. These models will help identify the key characteristics that drive cluster formation.By applying these models to new data, cluster assignments can be predicted, streamlining the analysis process and eliminating the need to recreate the clusters whenever new data is introduced.### 8.1 Decision TreeThe selected variables will be analyzed to explain the clusters using a decision tree, with the dataset split into training and test sets, applying 20-fold cross-validation.```{r}# Prepare data for decision treemodel_data <- full_data_customer %>% dplyr::select(all_of(selected_vars), CLUSTER) %>%mutate(CLUSTER =as.factor(CLUSTER)) # Create train/test split (70% train, 30% test)set.seed(500) train_indices <-createDataPartition(model_data$CLUSTER, p =0.7, list =FALSE)train_data <- model_data[train_indices, ]test_data <- model_data[-train_indices, ]# Set up cross-validation (20-fold)train_control <-trainControl(method ="cv", number =20)# Train the decision tree model with cross-validationdecision_tree_model <-train( CLUSTER ~ ., data = train_data, method ="rpart", trControl = train_control,tuneLength =5) # Plot the decision treerpart.plot( decision_tree_model$finalModel, extra =101, box.palette ="Blues", shadow.col ="gray", nn =TRUE, main ="Decision Tree: Explaining Customer Clusters", branch.col ="gray", faclen =0,tweak =1.1) ```Below are the prediction performance metrics:```{r}# Evaluate model performance on test setdt_test_predictions <-predict(decision_tree_model, test_data, type ="raw")dt_test_confusion_matrix <-confusionMatrix(dt_test_predictions, test_data$CLUSTER)# Calculate accuracy on the test setdt_test_accuracy <-round(mean(dt_test_predictions == test_data$CLUSTER), 2)# Evaluate model performance on train setdt_train_predictions <-predict(decision_tree_model, train_data, type ="raw")dt_train_accuracy <-round(mean(dt_train_predictions == train_data$CLUSTER), 2)# Print model performance metricscat("\n--- Decision Tree Model Performance ---\n")print(dt_test_confusion_matrix)``````{r}# Evaluate accuracy on train and test sets for decision treedt_train_acc <-round(mean(dt_train_predictions == train_data$CLUSTER), 2)dt_test_acc <-round(mean(dt_test_predictions == test_data$CLUSTER), 2)# Create comparison dataframedt_acc_comp <-data.frame(Set =c("Train", "Test"),Accuracy =c(dt_train_acc, dt_test_acc))# Display the formatted table with kabledt_acc_comp %>%kable(caption ="Decision Tree Accuracy Comparison (Train vs Test)", col.names =c("Dataset", "Accuracy")) %>%kable_styling(bootstrap_options =c("striped", "hover"))```The model has an accuracy of 91% on both the train and test sets, demonstrating strong performance across all clusters. In Cluster 1: High Demand Customers, precision is 90.2% and recall is 82.5%. For Cluster 2: Intermediate Customers with Growth Potential, precision is 91.5% and recall is 90.8%. For Cluster 3: Less Active Customers with Low Order Volume, precision is 89.6% and recall is 97.0%.Overall, the model performs well across all clusters, with strong precision and recall values for Cluster 1 and Cluster 3, and solid performance in Cluster 2. The accuracy comparison between the train and test sets is identical at 91%, indicating good generalization.```{r}# Extract and display variable importance from the trained decision tree modelvar_importance <- decision_tree_model$finalModel$variable.importancedt_var_importance_df <-data.frame(Variable =names(var_importance),Importance = var_importance)# Normalize importance valuesdt_var_importance_df <- dt_var_importance_df %>%mutate(Importance = Importance /max(Importance))# Sort by importance and visualize top 10 variablesdt_var_importance_df <- dt_var_importance_df %>%arrange(desc(Importance)) %>%head(10)# Plot the top 10 most important variablesggplot(dt_var_importance_df, aes(x =reorder(Variable, Importance), y = Importance)) +geom_bar(stat ="identity", fill ="seagreen") +coord_flip() +labs(title ="Top 10 Variables Explaining Customer Clusters",subtitle ="Decision Tree Variable Importance",x =NULL,y ="Relative Importance (Normalized)" ) +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),axis.title =element_text(face ="bold"),panel.grid.major.y =element_blank() )```The most important variables in the model were the number of orders per customer, average days between orders, RFM score, total ordered volume, total cost, and the low demand customers flag.### 8.2 Multinomial Logistic RegressionThe influence of the selected variables on customer clusters will be explored using multinomial logistic regression to predict the probabilities of new customers belonging to each of the established clusters. This method is well-suited for modeling the relationship between the predictors and the probabilities of customers being assigned to one of the three clusters, helping to assess the likelihood of a customer belonging to each specific group based on their characteristics.Variable standardization and Elastic Net regularization will be used in the model development process.```{r, warning=FALSE}# Normalize predictorspreprocess_params <-preProcess(model_data, method =c("center", "scale"))model_data <-predict(preprocess_params, model_data)# Create train/test split (70% train, 30% test)set.seed(500) train_indices <-createDataPartition(model_data$CLUSTER, p =0.7, list =FALSE)train_data <- model_data[train_indices, ]test_data <- model_data[-train_indices, ]# Set up cross-validation train_control <-trainControl(method ="cv", number =10)# Define a smaller tuning grid for efficiencytune_grid <-expand.grid(alpha =0.5, lambda =seq(0.1, 1, length =5))# Train model with Elastic Net regularizationmlogistic_model <-train( CLUSTER ~ ., data = train_data, method ="glmnet",trControl = train_control,tuneGrid = tune_grid,control =list(maxit =200000),)# Print trained model summaryprint(mlogistic_model)# Make predictions on the test setmlogis_predictions <-predict(mlogistic_model, test_data)# Evaluate model performancemlogistic_model_performance <-postResample(pred = mlogis_predictions, obs = test_data$CLUSTER)print(mlogistic_model_performance)# Display confusion matrixmlogistic_confusion_matrix <-confusionMatrix(mlogis_predictions, test_data$CLUSTER)print(mlogistic_confusion_matrix)``````{r}# Generate predictions on train settrain_predictions <-predict(mlogistic_model, train_data)# Evaluate accuracy on train and test setsmlogistic_train_acc <-round(postResample(pred = train_predictions, obs = train_data$CLUSTER)["Accuracy"], 2)mlogistic_test_acc <-round(postResample(pred = mlogis_predictions, obs = test_data$CLUSTER)["Accuracy"], 2)# Create comparison dataframemlogistic_acc_comp <-data.frame(Set =c("Train", "Test"),Accuracy =c(mlogistic_train_acc, mlogistic_test_acc))# Display the formatted tablemlogistic_acc_comp %>%kable(caption ="Multinomial Logistic Regression Accuracy Comparison (Train vs Test)", col.names =c("Dataset", "Accuracy")) %>%kable_styling(bootstrap_options =c("striped", "hover"))```The model achieved an accuracy of 89.3% on the test set, reflecting strong performance. In Cluster 1 (Red): High Demand Customers, recall is 70.8% and precision is 99.8%. For Cluster 2 (Blue): Intermediate Customers with Growth Potential, recall is 95.8% and precision is 85.6%. Finally, Cluster 3 (Yellow): Less Active Customers with Low Order Volume shows recall of 91 % and precision of 91.7%. Overall, the model performs well, with Cluster 2 showing the highest recall and Cluster 1 having the strongest precision.The relatively low recall in Cluster 1 (Red) (70.8%) suggests that the model may not always correctly identify customers in this group, leading to false negatives.```{r, warning=FALSE}# Extract variable importance from the multinomial modelvariable_importance <-varImp(mlogistic_model, scale =TRUE)# Define custom colors for the clusterspalette_clusters <-c("1"="#FF6347", # Coral"2"="#4682B4", # Cornflower blue"3"="#FFD700") # Yellow# Extract importance datavar_imp_df <-as.data.frame(variable_importance$importance)var_imp_df$Variable <-rownames(var_imp_df)# Convert to long formatvar_imp_long <-melt(var_imp_df, id.vars ="Variable", variable.name ="Cluster", value.name ="Importance")# Clean up cluster names (remove 'Overall' if present)var_imp_long$Cluster <-gsub("Overall", "", var_imp_long$Cluster)# Keep only top 10 variables per cluster for better visualizationtop_vars <- var_imp_long %>%group_by(Cluster) %>%top_n(10, Importance) %>%ungroup()# Create visualization with custom cluster colors# Create visualization with custom cluster colors and no color legendggplot(var_imp_long, aes(x =reorder(Variable, Importance), y = Importance, fill = Cluster)) +geom_bar(stat ="identity") +coord_flip() +facet_wrap(~ Cluster, scales ="free_x") +scale_fill_manual(values = palette_clusters) +# Apply custom colors# Set the y-axis (importance) to have the same scale 0-100 for all facetsscale_y_continuous(limits =c(0, 100)) +labs(title ="Multinomial Logistic Regression",subtitle ="Variable Importance by Cluster",x ="Variables",y ="Importance" ) +theme_minimal() +theme(plot.title =element_text(face ="bold", size =14),plot.subtitle =element_text(size =12, color ="gray30"),strip.text =element_text(size =12, face ="bold"),axis.title =element_text(face ="bold"),axis.text.y =element_text(size =9),panel.grid.major.y =element_blank() ) +guides(fill ="none") # Remove the color legend```The model indicates that: For **Cluster 1**, the key variables included the number of orders, RFM score, order type (MyCoke Legacy), order type (MyCoke 360), order type (using sales representatives), and chain member.For **Cluster 2**, the most significant variables were the average number of days between orders, low demand customers, order type (call center), order type (MyCoke Legacy), and channel growth potential.For **Cluster 3**, the most important variables were the average number of days between orders, low demand customers, RFM score, and days since the first delivery.The models created to predict clusters for new customers performed well and provide insights that clearly help in understanding the characteristics influencing the clusters. Therefore, we can proceed with the final analysis for fleet assignment.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# List all variables in the environmentall_vars <-ls()# Exclude 'full_data', 'full_data_customer', and the new variables from removalvars_to_keep <-c("full_data", "full_data_customer","mydir", "one_seed", "reference_date")# Get the variables to removevars_to_remove <-setdiff(all_vars, vars_to_keep)# Remove the temporary data framesrm(list = vars_to_remove)# Clean up by removing 'all_vars' and 'vars_to_remove'rm(all_vars, vars_to_remove)```## 9. Data driven fleet assingmentBased on all the previous analyses, it is concluded that the fleet type designated for clients should be defined by considering different criteria, not just the average annual volume demand.The main criteria shaping this approach include the similarities among clients represented by the clusters, the analysis of volume distributions by cold drink channel segment, and the growth potential of the clients.Before proceeding, the relationship between the 400 gallons annual threshold for each cluster will be analyzed.```{r, warning=FALSE}# Define custom colors for the clusterspalette_clusters <-c("1"="#FF6347", # Coral"2"="#4682B4", # Cornflower blue"3"="#FFD700"# Yellow)# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Create scatter plot with log scales, separated by CLUSTER using facet_wrapggplot() +geom_jitter(data = full_data_customer, aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color =as.factor(CLUSTER)),alpha =0.5, width =0.2) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +# Log scale for x-axis with specific limits (1 -> 500)scale_y_log10(limits =c(10, 1000000), # Set limits for the y-axis from 10 to 1,000,000breaks =c(10, 100, 1000, 10000, 100000, 1000000), # Custom breaks for the Y-axislabels = scales::comma # Format numbers with commas ) +scale_color_manual(values = palette_clusters, name ="Cluster") +scale_linetype_manual(values ="solid", name ="") +# Add the threshold line to legendlabs(title ="Average Annual Consumption vs. Number of Orders Cluster",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ CLUSTER, scales ="fixed") +# Ensure same scale across all facetstheme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =9), # Adjust facet labels' sizepanel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), # Major Y grid lines for the specific breakspanel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), # Light gray vertical grid lines as backgroundpanel.grid.minor =element_blank(), # Remove minor grid linespanel.background =element_rect(fill ="white", color ="white"), # Ensure clean backgroundlegend.position ="right"# Move legend to right side )```Regarding the 400-gallon benchmark for defining clients to be served by red trucks, it is possible to note that:- Cluster 1: This cluster mainly selects clients with higher demand volumes or a larger number of orders. A smaller portion of these clients would fall below the 400-gallon threshold, with some still close to a minimum of 100 gallons.- Cluster 2: This cluster has large number of clients above and below the threshold, so it requires further refinement.- Cluster 3: The vast majority of clients fall below the threshold. However, the few clients above it tend to place a small number of orders per year.### 9.1 Cluster 2 Analysis for Fleet AssignmentCluster 2 comprises just over half of all clients, making it difficult to define clear criteria for fleet designation.The multinomial regression model indicated that the variable "Average Days Between Orders" (AVG_DAYS_BET_ORD) was the most important, while in the decision tree model, it was the second most important variable. Therefore, below is the plot showing the relationship between the average annual consumption of each client and their average days between orders.```{r, warning=FALSE}# Define custom colors for the clusterspalette_clusters <-c("1"="#FF6347", # Coral"2"="#4682B4", # Cornflower blue"3"="#FFD700"# Yellow)# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Create scatter plot with log scale for x-axis, separated by CLUSTER using facet_wrapggplot() +geom_jitter(data = full_data_customer, aes(x = AVG_DAYS_BET_ORD, y = AVG_ANNUAL_CONSUMP, color =as.factor(CLUSTER)),alpha =0.5, width =0.2) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 1000), # Set limits for the x-axisbreaks =c(10, 100, 1000), # Custom breaks for the X-axislabels = scales::comma # Format numbers with commas ) +scale_y_log10(limits =c(10, 1000000), # Set limits for the y-axis from 10 to 1,000,000breaks =c(10, 100, 1000, 10000, 100000, 1000000), # Custom breaks for the Y-axislabels = scales::comma # Format numbers with commas ) +scale_color_manual(values = palette_clusters, name ="Cluster") +scale_linetype_manual(values ="solid", name ="") +# Add the threshold line to legendlabs(title ="Avg. Annual Consumption vs. Avg. Days Between Orders by Cluster",x ="Average Days Between Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ CLUSTER, scales ="fixed") +# Ensure same scale across all facetstheme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =9), # Adjust facet labels' sizepanel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), # Major Y grid lines for the specific breakspanel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), # Light gray vertical grid lines as backgroundpanel.grid.minor =element_blank(), # Remove minor grid linespanel.background =element_rect(fill ="white", color ="white"), # Ensure clean backgroundlegend.position ="right"# Move legend to right side )``````{r}# Filter the data for Cluster 2cluster_2_data <- full_data_customer %>%filter(CLUSTER ==2)# Calculate deciles for AVG_DAYS_BET_ORDdeciles <-quantile(cluster_2_data$AVG_DAYS_BET_ORD, probs =seq(0, 1, 0.1))# Create a simple data frame with the decile values, transposing it for horizontal displaydecile_table <-data.frame(Decile =paste0(seq(0, 90, 10), "%"),Lower_Bound = deciles[-length(deciles)], # All but the last quantile valueUpper_Bound = deciles[-1] # All but the first quantile value)# Print the decile table horizontallydecile_table_t <-t(decile_table[,-1])colnames(decile_table_t) <- decile_table$Deciledecile_table_t# Get a summary of AVG_DAYS_BET_ORD for Cluster 2summary_cluster_2 <-summary(cluster_2_data$AVG_DAYS_BET_ORD)# Round the summary summary_cluster_2_rounded <-round(summary_cluster_2, 0)# Display the summary in a simple formatsummary_table <-data.frame(Statistic =names(summary_cluster_2_rounded),Value =as.vector(summary_cluster_2_rounded))# Print the summary tablesummary_table```When filtering the average days between orders for Cluster 2, it is observed that 60 percent of customers have an average of 33 days or fewer between orders. The group's average is 56.4 days, with a median of 24 days.Building upon the previously calculated variables, low demand customers and high growth potential customers, additional criteria relevant to the business will be introduced to better segment customers within Cluster 2.These new criteria include an average annual consumption greater than 1,349 gallons and an average of 52 or fewer days between orders. The first threshold was chosen because it represents the point at which delivery costs are minimized. The second threshold was selected due to its significant influence on the clustering model, and because customers with high growth potential (excluding low demand customers) and an average time between orders of 33 days or fewer—representing nearly two-thirds of customers—are believed to have the potential to order more frequently, thus reducing the order interval.As a result, in the plot below, customers who are not low demand, show high growth potential, or have an average annual consumption greater than 1,349 gallons and an average of 33 or fewer days between orders will be classified as Emerging Opportunities and assigned to the red truck category.```{r, message=FALSE, warning=FALSE}# Define custom colorspalette_clusters <-c("Emerging Opportunities - RED TRUCK"="#B33951", # Emerging Opportunities"General Clients - WHITE TRUCK"="#D3D3D3"# General Clients)# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Filter only Cluster 2 and create category for facetingfiltered_data <- full_data_customer %>%filter(CLUSTER ==2) %>%mutate(Category =ifelse(LOW_DEMAND_CUST ==0& HIGH_GROW_POT ==1& AVG_DAYS_BET_ORD <=33| AVG_ANNUAL_CONSUMP >1349, "Emerging Opportunities - RED TRUCK", "General Clients - WHITE TRUCK") )# Create scatter plot with facet_wrapggplot(filtered_data) +geom_jitter(aes(x = AVG_DAYS_BET_ORD, y = AVG_ANNUAL_CONSUMP, color = Category), width =0.2, alpha =0.5) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 1000)) +# Log scale for x-axis with specific limits (1 -> 500)scale_y_log10(limits =c(10, 1000000), # Set limits for the y-axis from 10 to 1,000,000breaks =c(10, 100, 1000, 10000, 100000, 1000000), # Custom breaks for the Y-axislabels = scales::comma # Format numbers with commas ) +scale_color_manual(values = palette_clusters, name ="Fleet Assignment") +scale_linetype_manual(values ="solid", name ="") +# Add the threshold line to legendlabs(title ="Cluster 2 - Avg. Annual Consumption vs. Avg. Days Between Orders",x ="Average Days Between Orders (Log Scale)",y ="Avg Annual Consumption (Log Scale)" ) +facet_wrap(~ Category, scales ="fixed") +# Separate categories side by sidetheme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),panel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.minor =element_blank(), # Remove minor grid linespanel.background =element_rect(fill ="white", color ="white"), legend.position ="right", # Move legend to right sidestrip.text =element_blank() # Remove facet titles )# Add CLUSTER_2_FLEET variable to full_data_customerfull_data_customer <- full_data_customer %>%mutate(CLUSTER_2_FLEET =ifelse(CLUSTER ==2,ifelse((LOW_DEMAND_CUST ==0& HIGH_GROW_POT ==1& AVG_DAYS_BET_ORD <=33) | AVG_ANNUAL_CONSUMP >1349, "RED TRUCK", "WHITE TRUCK"),NA) )```Below, the impact of fleet assignment on each cold drink channel will be explored using the previous criteria for Cluster 2, and its relation to average annual consumption and the number of orders will be analyzed.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Define the custom color palette for COLD_DRINK_CHANNEL with unique colorscold_drink_channel_colors <-c("DINING"="#A7ADC6", "PUBLIC SECTOR"="#FF6347", "EVENT"="#B33951", "WORKPLACE"="#ABD2FA", "ACCOMMODATION"="#E377C2", "GOODS"="#FFD700", "BULK TRADE"="#8ED081", "WELLNESS"="#20B2AA", "CONVENTIONAL"="#1F77B4")# Filter data for Cluster 2cluster_2_data <-subset(full_data_customer, CLUSTER ==2)# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Create scatter plot with log scales for Cluster 2ggplot() +geom_jitter(data = cluster_2_data, aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color =as.factor(COLD_DRINK_CHANNEL)),alpha =0.5, width =0.2) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +# Log scale for x-axis with specific limits (1 -> 500)scale_y_log10(limits =c(10, 100000), # Adjusted upper limit for 100,000breaks =c(10, 100, 1000, 10000, 100000), # Adjusted break pointslabels = scales::comma # Format numbers with commas ) +scale_color_manual(values = cold_drink_channel_colors) +# Apply custom colors to COLD_DRINK_CHANNELscale_linetype_manual(values ="solid", name =NULL) +# Remove title from legendlabs(title ="Average Annual Consumption vs. Number of Orders for Cluster 2",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ COLD_DRINK_CHANNEL, scales ="fixed") +# Facet by COLD_DRINK_CHANNELguides(color ="none") +# Remove legend for COLD_DRINK_CHANNELtheme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =10), # Adjust facet labels' sizepanel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), # Major Y grid lines for the specific breakspanel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), # Light gray vertical grid lines as backgroundpanel.grid.minor =element_blank(), # Remove minor grid linespanel.background =element_rect(fill ="white", color ="white"), # Ensure clean backgroundlegend.position ="bottom"# Move legend below the plot )``````{r, warning=FALSE}# Define color palettepalette_clusters <-c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")# Filter "RED TRUCK" data from full_data_customercluster_2_red_truck_data <-subset(full_data_customer, !is.na(CLUSTER_2_FLEET) & CLUSTER_2_FLEET =="RED TRUCK")# Create a data frame for the threshold line (fixed at 400 gallons)threshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Create the plotggplot() +geom_jitter(data = cluster_2_red_truck_data, aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color =as.factor(CLUSTER_2_FLEET)),alpha =0.5, width =0.2) +geom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +scale_y_log10(limits =c(10, 100000), breaks =c(10, 100, 1000, 10000, 100000), labels = comma ) +scale_color_manual(values = palette_clusters, name ="Fleet Assignment") +scale_linetype_manual(values ="solid", name =NULL) +labs(title ="Average Annual Consumption vs. Number of Orders for Cluster 2",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ COLD_DRINK_CHANNEL, scales ="fixed") +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =10), panel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.minor =element_blank(), panel.background =element_rect(fill ="white", color ="white"), legend.position ="right" )```In an effort to explore growth opportunities, almost all sectors would have a considerable number of clients with a volume of less than 400 gallons but using red trucks.```{r, warning=FALSE}# Define color palettepalette_clusters <-c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")# Filter "WHITE TRUCK" data from full_data_customercluster_3_white_truck_data <-subset(full_data_customer, !is.na(CLUSTER_2_FLEET) & CLUSTER_2_FLEET =="WHITE TRUCK")# Create a data frame for the threshold line (fixed at 400 gallons)threshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Create the plotggplot() +geom_jitter(data = cluster_3_white_truck_data, aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color =as.factor(CLUSTER_2_FLEET)),alpha =0.5, width =0.2) +geom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +scale_y_log10(limits =c(10, 100000), breaks =c(10, 100, 1000, 10000, 100000), labels = comma ) +scale_color_manual(values = palette_clusters, name ="Fleet Assignment") +scale_linetype_manual(values ="solid", name =NULL) +labs(title ="Average Annual Consumption vs. Number of Orders for Cluster 2",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ COLD_DRINK_CHANNEL, scales ="fixed") +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =10), panel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), panel.grid.minor =element_blank(), panel.background =element_rect(fill ="white", color ="white"), legend.position ="right" )```On the other hand, the criteria naturally assign white trucks to a large number of clients with an average annual volume of less than 400 gallons in each segment, while still ensuring that high-volume clients are served by red trucks.Also, the previous graphs represent an opportunity for the company to develop targeted strategies for each segment.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Filter data for customers with LOW_DEMAND_CUST == 0 and HIGH_GROW_POT == 1filtered_data_high_demand <-subset(full_data_customer, LOW_DEMAND_CUST ==0& HIGH_GROW_POT ==1)# Add a column showing the consumption categoryfiltered_data_high_demand$Threshold <-ifelse(filtered_data_high_demand$AVG_ANNUAL_CONSUMP >=400, ">= 400", "< 400")# Count the number of customers and calculate UNIT_COST for each COLD_DRINK_CHANNEL and Thresholdcustomer_and_cost_summary <- filtered_data_high_demand %>%group_by(COLD_DRINK_CHANNEL, Threshold) %>%summarise(Customer_Count =n(),Total_Cost =sum(TOTAL_COST_CA_GAL, na.rm =TRUE),Total_QTD_DLV =sum(QTD_DLV_TOTAL, na.rm =TRUE),UNIT_COST =round(Total_Cost / Total_QTD_DLV, 2) ) %>%arrange(COLD_DRINK_CHANNEL, Threshold) %>%mutate(Customer_Count =comma(Customer_Count), Total_Cost =comma(Total_Cost), Total_QTD_DLV =comma(Total_QTD_DLV) )# Display formatted datacustomer_and_cost_summary %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "Threshold", "Customer Count", "Total Cost", "Total Qtd DLV", "Unit Cost")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:6, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#4682B4") %>%add_header_above(c("Cluster 2 Summary"=6)) %>%kable_paper("striped", full_width =FALSE)``````{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Compute summary statistics for all customers in Cluster 2total_summary <- full_data_customer %>%filter(CLUSTER ==2) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Customer_Count =n(),Total_Cost =sum(TOTAL_COST_CA_GAL, na.rm =TRUE),Total_QTD_DLV =sum(QTD_DLV_TOTAL, na.rm =TRUE),UNIT_COST =round(Total_Cost / Total_QTD_DLV, 2) ) %>%mutate(Filter ="All Cluster 2 Cust" )# Compute summary statistics for filtered customers in Cluster 2filtered_summary <- full_data_customer %>%filter(CLUSTER ==2, LOW_DEMAND_CUST ==0, HIGH_GROW_POT ==1, AVG_ANNUAL_CONSUMP <1349) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Customer_Count =n(),Total_Cost =sum(TOTAL_COST_CA_GAL, na.rm =TRUE),Total_QTD_DLV =sum(QTD_DLV_TOTAL, na.rm =TRUE),UNIT_COST =round(Total_Cost / Total_QTD_DLV, 2) ) %>%mutate(Filter ="High-Performance Customers" )# Combine both summariescombined_data <-bind_rows(total_summary, filtered_summary) %>%arrange(COLD_DRINK_CHANNEL, desc(Filter))# Correctly calculate percentagescustomer_and_cost_summary <- combined_data %>%group_by(COLD_DRINK_CHANNEL) %>%mutate(Percentage_Customer_Count =ifelse(Filter =="High-Performance Customers", Customer_Count /first(Customer_Count[Filter =="All Cluster 2 Cust"]) *100, 100),Percentage_QTD_DLV =ifelse(Filter =="High-Performance Customers", Total_QTD_DLV /first(Total_QTD_DLV[Filter =="All Cluster 2 Cust"]) *100, 100) ) %>%ungroup()# Display the data correctly formattedcustomer_and_cost_summary %>%mutate(Customer_Count =comma(Customer_Count), Total_Cost =comma(Total_Cost), Total_QTD_DLV =comma(Total_QTD_DLV),Percentage_Customer_Count =round(Percentage_Customer_Count, 1), # Round percentagesPercentage_QTD_DLV =round(Percentage_QTD_DLV, 1) ) %>% dplyr::select(COLD_DRINK_CHANNEL, Customer_Count, Percentage_Customer_Count, Total_Cost, Total_QTD_DLV, Percentage_QTD_DLV, UNIT_COST, Filter) %>%kable("html", escape =FALSE, align ="c", col.names =c("Channel", "Customer Count", "Customer %", "Total Cost", "Total Qtd DLV", "Volume %", "Unit Cost", "Filter")) %>%kable_styling(full_width =FALSE, position ="center") %>%column_spec(1, bold =TRUE) %>%column_spec(2:8, width ="6em") %>%row_spec(0, bold =TRUE, color ="black", background ="#4682B4") %>%add_header_above(c("Cluster 2 Summary"=8)) %>%kable_paper("striped", full_width =FALSE)```### 9.2 Fleet Assignment CriteriaBased on the analysis, the recommended fleet assignment will be determined by the following criteria:1. Customers with an average annual consumption greater than **1349 gallons** will be assigned to **RED TRUCKS**.2. **Low-demand customers** (identified by `LOW_DEMAND_CUST == 1`) will be assigned to **WHITE TRUCKS**.3. All customers in **Cluster 1** will be assigned to **RED TRUCKS**, according to the previous rules.4. All customers in **Cluster 3** will be assigned to **WHITE TRUCKS**, after applying the previous rules.5. **Customers in Cluster 2** will be assigned to **RED TRUCKS** if they meet at least one of the following conditions: - They are classified as **high growth potential** (`HIGH_GROW_POT == 1`). - Their **average days between orders** are less than or equal to 33 (`AVG_DAYS_BET_ORD <= 33`).6. The remaining customers in **Cluster 2** will be assigned to **WHITE TRUCKS**.7. Any customers who do not meet any of these criteria will remain unclassified (`NA`).```{r}# Assign customers to RED TRUCK or WHITE TRUCK based on specified criteriafull_data_customer <- full_data_customer %>%mutate(NEW_FLEET =case_when( AVG_ANNUAL_CONSUMP >1349~"RED TRUCK", # Customers with high annual consumption LOW_DEMAND_CUST ==1~"WHITE TRUCK", # Low-demand customers CLUSTER ==1~"RED TRUCK", # Cluster 1 customers CLUSTER ==3~"WHITE TRUCK", # Cluster 3 customers CLUSTER ==2& (LOW_DEMAND_CUST ==0& HIGH_GROW_POT ==1& AVG_DAYS_BET_ORD <=33) ~"RED TRUCK", # Cluster 2 customers with all conditions met CLUSTER ==2~"WHITE TRUCK", # Remaining Cluster 2 customersTRUE~NA_character_# Others remain NA ) )```Below are the representations of the clusters and the designated fleet.```{r, warning=FALSE}# Define custom colors for the fleet and clusterspalette_fleet <-c("RED TRUCK"="#B33951", # Red truck"WHITE TRUCK"="#D3D3D3"# White truck)palette_clusters <-c("Cluster 1"="#FF6347", # Coral"Cluster 2"="#4682B4", # Cornflower blue"Cluster 3"="#FFD700"# Yellow)# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Filter the data for RED TRUCK and WHITE TRUCKred_truck_data <- full_data_customer %>%filter(NEW_FLEET =="RED TRUCK")white_truck_data <- full_data_customer %>%filter(NEW_FLEET =="WHITE TRUCK")# Combine both datasets to differentiate them in facet_wrapcombined_data <-bind_rows( red_truck_data %>%mutate(Fleet_Type ="RED TRUCK"), white_truck_data %>%mutate(Fleet_Type ="WHITE TRUCK"))# Define a custom labeller for the clusterscustom_labeller <-labeller(CLUSTER =c("1"="Cluster 1","2"="Cluster 2","3"="Cluster 3" ))# Create scatter plot with log scales and no background color for facet labelsggplot(combined_data) +geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = Fleet_Type),alpha =0.5, width =0.2) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +# Log scale for x-axis with specific limitsscale_y_log10(limits =c(10, 1000000),breaks =c(10, 100, 1000, 10000, 100000, 1000000),labels = scales::comma ) +scale_color_manual(values = palette_fleet) +scale_linetype_manual(values ="solid", name ="") +labs(title ="Fleet Assignment by Cluster",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ CLUSTER + Fleet_Type, scales ="fixed", nrow =1, labeller = custom_labeller) +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =9), # Adjust text size for facet labelsstrip.background =element_blank(), # Remove background color from facet labelspanel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.minor =element_blank(),panel.background =element_rect(fill ="white", color ="white"),legend.position ="bottom",legend.box ="vertical" ) +guides(color ="none") # Remove legend for Fleet_Type# Summmary#summary(as.factor(full_data_customer$NEW_FLEET))#summary(as.factor(full_data_customer$FLEET_TYPE))# RED TRUCK WHITE TRUCK # 7239 23081 ``````{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Create a table for FLEET_TYPE counts by Clusterfleet_type_table <- full_data_customer %>%group_by(CLUSTER, FLEET_TYPE) %>%summarise(count =n(), .groups ="drop") %>%gt() %>%tab_header(title ="400 gallons threshold - customer count" ) %>%cols_label(CLUSTER ="Cluster",FLEET_TYPE ="Fleet Type",count ="Customer Count" )# Create a table for NEW_FLEET counts by Clusternew_fleet_table <- full_data_customer %>%group_by(CLUSTER, NEW_FLEET) %>%summarise(count =n(), .groups ="drop") %>%gt() %>%tab_header(title ="New Fleet - customer count" ) %>%cols_label(CLUSTER ="Cluster",NEW_FLEET ="New Fleet",count ="Customer Count" )# View the tablesfleet_type_tablenew_fleet_table```The new criteria established labels for all customers. A total of 7,926 customers were assigned to "Red Truck", while 22,394 customers were assigned to "White Truck".The annual average consumption criterion of 400 gallons would have assigned 7,239 customers to be served by "Red Truck" and 23,081 customers to be served by "White Trucks".Therefore, 687 clients who were previously served by white trucks and who present higher growth potential will now be served by red trucks.```{r, warning=FALSE, message=FALSE}# Create a combined summary for both fleet types with percentages calculated separately by fleet_designationsummary_fleet_comparison_percent <- full_data_customer %>%# Create a longer dataset with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Convert to factorsmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) ) %>%# Group by fleet_designation and fleet_value, then calculate countgroup_by(fleet_designation, fleet_value) %>%summarise(count =n(), .groups ='drop') %>%# Calculate the percentage for each fleet_designationgroup_by(fleet_designation) %>%mutate(percentage = (count /sum(count)) *100) %>%ungroup()# Plot with facet_wrap and custom background showing percentage valuesggplot(summary_fleet_comparison_percent, aes(x = fleet_value, y = percentage, fill = fleet_value)) +# Add background based on facetgeom_rect(data =data.frame(fleet_designation ="Over 400 gallons threshold"),aes(xmin =-Inf, xmax =Inf, ymin =-Inf, ymax =Inf),fill ="lightyellow", alpha =0.3, inherit.aes =FALSE) +# Light gray background for "Over 400 gallons threshold"geom_bar(stat ="identity", position ="dodge", alpha =0.8) +geom_text(aes(label = scales::comma(percentage, accuracy =0.1, suffix ="%")), position =position_dodge(width =0.8), vjust =-0.5, size =3.5) +facet_wrap(~ fleet_designation, scales ="fixed") +# Fixed scale for both facetslabs(title ="Comparison of Customer Distribution by Fleet Type Designation",x ="Fleet Type") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =9),axis.title.y =element_text(size =10),axis.text.x =element_text(size =10),axis.title.x =element_blank(),legend.position ="none",panel.grid.major.x =element_blank(),strip.text =element_text(size =11, face ="bold"),strip.background =element_rect(fill ="white", color =NA),panel.spacing =unit(1, "lines") )```According to the criteria, 26% of customers would be served by red trucks and 74% by white trucks.```{r, warning=FALSE, message=FALSE}# Creating a combined summary for both fleet typessummary_fleet_comparison_absolute <- full_data_customer %>%# Create a longer dataset with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Convert to factorsmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) ) %>%# Group and calculate total counts (absolute numbers)group_by(fleet_designation, LOCAL_FOUNT_ONLY, fleet_value) %>%summarise(count =n()) %>%ungroup()# Plot with facet_wrap and custom background showing total customer countsggplot(summary_fleet_comparison_absolute, aes(x = LOCAL_FOUNT_ONLY, y = count, fill = fleet_value)) +# Add background based on facetgeom_rect(data =data.frame(fleet_designation ="Over 400 gallons threshold"),aes(xmin =-Inf, xmax =Inf, ymin =-Inf, ymax =Inf),fill ="lightyellow", alpha =0.3, inherit.aes =FALSE) +# Light gray background for "Over 400 gallons threshold"geom_bar(stat ="identity", position ="dodge", alpha =0.8) +geom_text(aes(label = scales::comma(count)), position =position_dodge(width =0.8), vjust =-0.5, size =3.5) +facet_wrap(~ fleet_designation, scales ="fixed") +# Fixed scale for both facetslabs(title ="Comparison of Number of Customers by Fleet Type Designation",x ="Customer Type") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =9),axis.title.y =element_text(size =10),axis.text.x =element_text(size =10, color ="black"),axis.title.x =element_blank(),legend.position ="bottom",legend.title =element_text(face ="bold"),panel.grid.major.x =element_blank(),strip.text =element_text(size =11, face ="bold"),strip.background =element_rect(fill ="white", color =NA),panel.spacing =unit(1, "lines") )```Considering only the "Others" group (customers who order from multiple sources), our recommendation would result in 729 additional stores being served by red trucks, compared to the 400-gallon threshold—an increase of 10.3%. In contrast, within the 'Local Fountain Only' group, the number of customers served by red trucks would decrease by 42, representing a 23.2% reduction. ```{r, warning=FALSE, message=FALSE}# Creating the summary data with both fleet designationssummary_volume_comparison <- full_data_customer %>%# Calculate total volume per customermutate(total_volume = QTD_DLV_CA_2023 + QTD_DLV_GAL_2023 + QTD_DLV_CA_2024 + QTD_DLV_GAL_2024) %>%# Create a longer format dataset with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Group by fleet designation, value, and LOCAL_FOUNT_ONLYgroup_by(fleet_designation, fleet_value, LOCAL_FOUNT_ONLY) %>%# Sum volumes within each groupsummarise(group_volume =sum(total_volume, na.rm =TRUE)) %>%ungroup() %>%# Calculate total volume for percentagegroup_by(fleet_designation) %>%mutate(total_designation_volume =sum(group_volume),percentage = group_volume / total_designation_volume *100) %>%ungroup() %>%# Convert to factors for proper orderingmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) )# Create the faceted chartggplot(summary_volume_comparison, aes(x = LOCAL_FOUNT_ONLY, y = percentage, fill = fleet_value)) +# Add background based on facetgeom_rect(data =data.frame(fleet_designation ="Over 400 gallons threshold"),aes(xmin =-Inf, xmax =Inf, ymin =-Inf, ymax =Inf),fill ="lightyellow", alpha =0.3, inherit.aes =FALSE) +# Light yellow background for Over 400 gallons thresholdgeom_bar(stat ="identity", position ="dodge", alpha =0.8) +geom_text(aes(label = scales::comma(percentage, accuracy =0.1, suffix ="%")), position =position_dodge(width =0.9), vjust =-0.5, size =3) +facet_wrap(~ fleet_designation, scales ="fixed") +# Fixed scale for both facetslabs(title ="Comparison of Volume Distribution by Fleet Type Designation",y ="Percentage of Total Volume") +# Set colors - assuming similar colors for both designationsscale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3"),name ="Fleet Type") +# Set x-axis labels scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =9),axis.title.y =element_text(size =10),axis.text.x =element_text(size =10, angle =0),axis.title.x =element_blank(),legend.position ="bottom",panel.grid.major.x =element_blank(),strip.text =element_text(size =11, face ="bold") )``````{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Creating the summary data with both fleet designationssummary_volume_comparison <- full_data_customer %>%# Calculate total volume per customermutate(total_volume = QTD_DLV_CA_2023 + QTD_DLV_GAL_2023 + QTD_DLV_CA_2024 + QTD_DLV_GAL_2024) %>%# Create a longer format dataset with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Group by fleet designation, value, and LOCAL_FOUNT_ONLYgroup_by(fleet_designation, fleet_value, LOCAL_FOUNT_ONLY) %>%# Sum volumes within each groupsummarise(group_volume =sum(total_volume, na.rm =TRUE)) %>%ungroup() %>%# Convert to factors for proper orderingmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) )# Create a wide table for comparison, calculating the differencesvolume_comparison_table <- summary_volume_comparison %>%# Pivot the data wider to separate the two fleet designationspivot_wider(names_from = fleet_designation,values_from = group_volume ) %>%# Calculate the absolute and percentage differencesmutate(absolute_diff =`Recommended Fleet Type`-`Over 400 gallons threshold`,percentage_diff = (absolute_diff /`Over 400 gallons threshold`) *100 ) %>%# Add a column to calculate the total volume per `LOCAL_FOUNT_ONLY`group_by(LOCAL_FOUNT_ONLY) %>%mutate(total_local_volume =sum(`Over 400 gallons threshold`+`Recommended Fleet Type`)) %>%ungroup()# Print the resulting tableprint(volume_comparison_table)# Adjusting delivery cost for WHITE TRUCK (ARTM assumption)full_data_customer <- full_data_customer %>%mutate(ARTM_TOTAL_COST = TOTAL_COST_CA_GAL /5)# Summarizing delivery cost by both fleet designationssummary_delivery_cost_comparison <- full_data_customer %>%mutate(delivery_cost =case_when( FLEET_TYPE =="WHITE TRUCK"~ ARTM_TOTAL_COST, FLEET_TYPE =="RED TRUCK"~ TOTAL_COST_CA_GAL,TRUE~NA_real_ )) %>%pivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%mutate(delivery_cost =case_when( fleet_designation =="NEW_FLEET"& fleet_value =="WHITE TRUCK"~ ARTM_TOTAL_COST, fleet_designation =="NEW_FLEET"& fleet_value =="RED TRUCK"~ TOTAL_COST_CA_GAL, fleet_designation =="FLEET_TYPE"& fleet_value =="WHITE TRUCK"~ ARTM_TOTAL_COST, fleet_designation =="FLEET_TYPE"& fleet_value =="RED TRUCK"~ TOTAL_COST_CA_GAL,TRUE~ delivery_cost )) %>%group_by(fleet_designation, fleet_value, LOCAL_FOUNT_ONLY) %>%summarise(total_delivery_cost =sum(delivery_cost, na.rm =TRUE)) %>%ungroup() %>%group_by(fleet_designation) %>%mutate(total_designation_cost =sum(total_delivery_cost, na.rm =TRUE),percentage = total_delivery_cost / total_designation_cost *100 ) %>%ungroup() %>%mutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation,levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 Gallons Threshold", "Recommended Fleet Type")) )```Although the number of customers served by red trucks has increased, the overall volume transported remains relatively stable.Within the "Others" customer group, there would be a reduction of approximately 1,038,637 gallons over two years, representing a 3.4% decrease. This volume would now be delivered by white trucks.For the "Local Fountain Only" group, the reduction in volume transported by red trucks is around 104,895 gallons over two years a 31% decrease.Despite the increase in the number of customers served by red trucks, which may lead to higher travel times and costs, the recommendation optimizes the delivery system by allowing red trucks to focus on strategic customers while reducing overall costs through higher-volume deliveries using white trucks.A geographic distribution analysis of the customer base can be carried out at a later stage. One opportunity that emerges from this recommendation is to encourage customers within the same ZIP code to coordinate delivery dates. This would help consolidate volumes, streamline the delivery process, and further reduce operational costs.Below is the average number of days between orders for each group.```{r, warning=FALSE, message=FALSE}# Creating a combined summary for both fleet types with mean AVG_DAYS_BET_ORDsummary_fleet_comparison_absolute <- full_data_customer %>%# Create a longer dataset with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Convert to factorsmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) ) %>%# Group by relevant factors and calculate the mean of AVG_DAYS_BET_ORD (average days between orders)group_by(fleet_designation, LOCAL_FOUNT_ONLY, fleet_value) %>%summarise(mean_days_bet_ord =mean(AVG_DAYS_BET_ORD, na.rm =TRUE)) %>%ungroup()# Plot with facet_wrap and custom background showing mean AVG_DAYS_BET_ORDggplot(summary_fleet_comparison_absolute, aes(x = LOCAL_FOUNT_ONLY, y = mean_days_bet_ord, fill = fleet_value)) +# Add background based on facetgeom_rect(data =data.frame(fleet_designation ="Over 400 gallons threshold"),aes(xmin =-Inf, xmax =Inf, ymin =-Inf, ymax =Inf),fill ="lightyellow", alpha =0.3, inherit.aes =FALSE) +# Light yellow background for "Over 400 gallons threshold"geom_bar(stat ="identity", position ="dodge", alpha =0.8) +geom_text(aes(label = scales::number(mean_days_bet_ord, accuracy =0.1)), position =position_dodge(width =0.8), vjust =-0.5, size =3.5) +facet_wrap(~ fleet_designation, scales ="fixed") +# Fixed scale for both facetslabs(title ="Comparison of Mean Days Between Orders by Fleet Type Designation",x ="Customer Type") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3"),name ="Fleet Type") +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =9),axis.title.y =element_text(size =10),axis.text.x =element_text(size =10, color ="black"),axis.title.x =element_blank(),legend.position ="bottom",legend.title =element_text(face ="bold"),panel.grid.major.x =element_blank(),strip.text =element_text(size =11, face ="bold"),strip.background =element_rect(fill ="white", color =NA),panel.spacing =unit(1, "lines") )```The red trucks should be optimized to serve the "Others" group, which has an average order interval of 14 days, compared to 23 days under the 400-gallon threshold model. The difference for the "Local Fountain Only" group in relation to the white trucks would be approximately 2 days.The white trucks, on the other hand, would serve more sporadic customers, with an average interval of over 260 days between orders.## 10. Recommendation Impacts### 10.1 Impact on CostsThe cost impact of using red trucks is significantly higher compared to white trucks. For OPEX, the delivery cost for red trucks is approximately 700% more than for white trucks when considering only variable costs.The calculated cost for the total volume delivered to each customer via red trucks is represented in the column `total_cos_ca_gal`. To provide conservative estimates, a 400% difference is assumed, and the red truck cost is divided by 5 to estimate the cost for white trucks, represented by `ARTM_TOTAL_COST`.Below is the cost comparison.```{r, warning=FALSE, message=FALSE}# Reducing the TOTAL_COST_CA_GAL by full_data_customer <- full_data_customer %>%mutate(ARTM_TOTAL_COST = TOTAL_COST_CA_GAL /5)# Creating the summary data with both fleet designations for delivery cost analysissummary_delivery_cost_comparison <- full_data_customer %>%# Create a column that has the appropriate cost based on fleet typemutate(delivery_cost =case_when( FLEET_TYPE =="WHITE TRUCK"~ ARTM_TOTAL_COST, FLEET_TYPE =="RED TRUCK"~ TOTAL_COST_CA_GAL,TRUE~NA_real_ )) %>%# Reshape the data into a longer format with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Update the delivery cost calculation for NEW_FLEET designationmutate(delivery_cost =case_when( fleet_designation =="NEW_FLEET"& fleet_value =="WHITE TRUCK"~ ARTM_TOTAL_COST, fleet_designation =="NEW_FLEET"& fleet_value =="RED TRUCK"~ TOTAL_COST_CA_GAL, fleet_designation =="FLEET_TYPE"& fleet_value =="WHITE TRUCK"~ ARTM_TOTAL_COST, fleet_designation =="FLEET_TYPE"& fleet_value =="RED TRUCK"~ TOTAL_COST_CA_GAL,TRUE~ delivery_cost )) %>%# Group by fleet designation, value, and LOCAL_FOUNT_ONLYgroup_by(fleet_designation, fleet_value, LOCAL_FOUNT_ONLY) %>%# Sum delivery costs in each groupsummarise(total_delivery_cost =sum(delivery_cost, na.rm =TRUE)) %>%ungroup() %>%# Convert to factors for proper orderingmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) )# Create the faceted chartggplot(summary_delivery_cost_comparison, aes(x = LOCAL_FOUNT_ONLY, y = total_delivery_cost, fill = fleet_value)) +# Add background based on facetgeom_rect(data =data.frame(fleet_designation ="Over 400 gallons threshold"),aes(xmin =-Inf, xmax =Inf, ymin =-Inf, ymax =Inf),fill ="lightyellow", alpha =0.3, inherit.aes =FALSE) +# Light yellow background for Over 400 gallons thresholdgeom_bar(stat ="identity", position ="dodge", alpha =0.8) +geom_text(aes(label = scales::dollar(total_delivery_cost, accuracy =1)), position =position_dodge(width =0.9), vjust =-0.5, size =3) +facet_wrap(~ fleet_designation, scales ="free_y") +# Allow y-axis to vary between facets if necessarylabs(title ="Comparison of Delivery Cost by Fleet Type Designation",y ="Total Delivery Cost ($ Millions)") +# Set colors for the fleet typesscale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3"),name ="Fleet Type") +# Set x-axis labels scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +# Scale y-axis to display values in millionsscale_y_continuous(labels = scales::label_number(scale =1e-6, suffix ="M")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =9),axis.title.y =element_text(size =10),axis.text.x =element_text(size =10, angle =0),axis.title.x =element_blank(),legend.position ="bottom",panel.grid.major.x =element_blank(),strip.text =element_text(size =11, face ="bold") )``````{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Creating the summary data with both fleet designations for delivery cost analysissummary_delivery_cost_comparison <- full_data_customer %>%# Create a column that has the appropriate cost based on fleet typemutate(delivery_cost =case_when( FLEET_TYPE =="WHITE TRUCK"~ ARTM_TOTAL_COST, FLEET_TYPE =="RED TRUCK"~ TOTAL_COST_CA_GAL,TRUE~NA_real_ )) %>%# Reshape the data into a longer format with both fleet designationspivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation",values_to ="fleet_value" ) %>%# Update the delivery cost calculation for NEW_FLEET designationmutate(delivery_cost =case_when( fleet_designation =="NEW_FLEET"& fleet_value =="WHITE TRUCK"~ ARTM_TOTAL_COST, fleet_designation =="NEW_FLEET"& fleet_value =="RED TRUCK"~ TOTAL_COST_CA_GAL, fleet_designation =="FLEET_TYPE"& fleet_value =="WHITE TRUCK"~ ARTM_TOTAL_COST, fleet_designation =="FLEET_TYPE"& fleet_value =="RED TRUCK"~ TOTAL_COST_CA_GAL,TRUE~ delivery_cost )) %>%# Group by fleet designation, value, and LOCAL_FOUNT_ONLYgroup_by(fleet_designation, fleet_value, LOCAL_FOUNT_ONLY) %>%# Sum delivery costs in each groupsummarise(total_delivery_cost =sum(delivery_cost, na.rm =TRUE)) %>%ungroup() %>%# Convert to factors for proper orderingmutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation =factor(fleet_designation, levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 gallons threshold", "Recommended Fleet Type")) )# Create a wide table for comparison, calculating the differencesdelivery_cost_comparison_table <- summary_delivery_cost_comparison %>%# Pivot the data wider to separate the two fleet designationspivot_wider(names_from = fleet_designation,values_from = total_delivery_cost ) %>%# Calculate the absolute and percentage differencesmutate(absolute_diff =`Recommended Fleet Type`-`Over 400 gallons threshold`,percentage_diff = (absolute_diff /`Over 400 gallons threshold`) *100 ) %>%# Add a column to calculate the total delivery cost per `LOCAL_FOUNT_ONLY`group_by(LOCAL_FOUNT_ONLY) %>%mutate(total_local_cost =sum(`Over 400 gallons threshold`+`Recommended Fleet Type`)) %>%ungroup()# Print the resulting tableprint(delivery_cost_comparison_table)``````{r, message=FALSE, warning=FALSE}# Summary of delivery costs by fleet designationdelivery_cost_summary <- full_data_customer %>%mutate(delivery_cost =case_when( FLEET_TYPE =="WHITE TRUCK"~ ARTM_TOTAL_COST, FLEET_TYPE =="RED TRUCK"~ TOTAL_COST_CA_GAL,TRUE~NA_real_ )) %>%pivot_longer(cols =c(NEW_FLEET, FLEET_TYPE),names_to ="fleet_designation_type",values_to ="fleet_label" ) %>%mutate(delivery_cost =case_when( fleet_designation_type =="NEW_FLEET"& fleet_label =="WHITE TRUCK"~ ARTM_TOTAL_COST, fleet_designation_type =="NEW_FLEET"& fleet_label =="RED TRUCK"~ TOTAL_COST_CA_GAL, fleet_designation_type =="FLEET_TYPE"& fleet_label =="WHITE TRUCK"~ ARTM_TOTAL_COST, fleet_designation_type =="FLEET_TYPE"& fleet_label =="RED TRUCK"~ TOTAL_COST_CA_GAL,TRUE~ delivery_cost )) %>%group_by(fleet_designation_type, fleet_label, LOCAL_FOUNT_ONLY) %>%summarise(total_delivery_cost =sum(delivery_cost, na.rm =TRUE)) %>%ungroup() %>%mutate(LOCAL_FOUNT_ONLY =as.factor(LOCAL_FOUNT_ONLY),fleet_designation_type =factor(fleet_designation_type,levels =c("FLEET_TYPE", "NEW_FLEET"),labels =c("Over 400 Gallons Threshold", "Recommended Fleet Type")) )# Comparison table with savingsfleet_savings_summary <- delivery_cost_summary %>%pivot_wider(id_cols =c(LOCAL_FOUNT_ONLY, fleet_label),names_from = fleet_designation_type,values_from = total_delivery_cost ) %>%mutate(savings =`Over 400 Gallons Threshold`-`Recommended Fleet Type`,savings_percentage = (savings /`Over 400 Gallons Threshold`) *100 )# Cost savings bar chartsavings_plot <-ggplot(fleet_savings_summary, aes(x = LOCAL_FOUNT_ONLY, y = savings, fill = fleet_label)) +geom_bar(stat ="identity", position ="dodge", alpha =0.8) +geom_text(aes(label =paste0( scales::dollar(savings, accuracy =1), "\n(", scales::number(savings_percentage, accuracy =0.1), "%)")), position =position_dodge(width =0.9), vjust =-0.5, size =3) +labs(title ="Cost Savings for Recommended Fleet Type x 400 gallons threshold",y ="Cost Savings ($)") +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3"),name ="Fleet Type") +scale_x_discrete(labels =c("0"="Others", "1"="Local Fountain Only")) +coord_cartesian(ylim =c(-500000, 1500000)) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =9),axis.title.y =element_text(size =10),axis.text.x =element_text(size =10, angle =0),axis.title.x =element_blank(),legend.position ="bottom",panel.grid.major.x =element_blank() )savings_plot``````{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Savings table for displaysavings_summary_table <- fleet_savings_summary %>%mutate(LOCAL_FOUNT_ONLY =ifelse(LOCAL_FOUNT_ONLY =="0", "Others", "Local Fountain Only"),`Current Cost`=`Over 400 Gallons Threshold`,`Recommended Cost`=`Recommended Fleet Type`,`Savings ($)`= savings,`Savings (%)`= savings_percentage ) %>% dplyr::select(LOCAL_FOUNT_ONLY, fleet_label, `Current Cost`, `Recommended Cost`, `Savings ($)`, `Savings (%)`)# Print savings tableprint(savings_summary_table)``````{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# List all variables in the environmentall_vars <-ls()# Exclude 'full_data', 'full_data_customer', and the new variables from removalvars_to_keep <-c("full_data", "full_data_customer","mydir", "one_seed", "reference_date")# Get the variables to removevars_to_remove <-setdiff(all_vars, vars_to_keep)# Remove the temporary data framesrm(list = vars_to_remove)# Clean up by removing 'all_vars' and 'vars_to_remove'rm(all_vars, vars_to_remove)```Regarding the 400-gallon threshold, over a two-year period the estimated differences would be:- **Others – Red Truck**: cost reduction of **$803,612** (2%);- **Others – White Truck**: cost increase of **$160,722** (3%);- **Local Fountain Only – Red Truck**: cost reduction of **$161,079** (34%);- **Local Fountain Only – White Truck**: cost increase of **$32,216** (22%).The total cost using the 400-gallon threshold over two years would be $46,462,823, while the recommendation totals $45,691,071. The net result over these two years would be a total savings of $771,752, representing a 1.7% reduction compared to the original 400-gallon threshold strategy.These values were calculated based on actual historical delivery volumes. Predicting whether these savings will continue in the future is highly uncertain due to many potential influencing factors—such as economic shifts, customer reactions, competitor strategies, and more. Additionally, the limited historical data (only two years) adds uncertainty to future projections.### 10.2 Impact on Fleet Assignment by Cold Drink Channel```{r}# Summarize volume per Cold Drink Channel and NEW_FLEETdata_summary_fleet_channel <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL, NEW_FLEET) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ="drop" ) %>%# Filter out the "CONVENTIONAL" channelfilter(COLD_DRINK_CHANNEL !="CONVENTIONAL") %>%group_by(COLD_DRINK_CHANNEL) %>%mutate(Percentage =round(Total_Volume /sum(Total_Volume) *100)) %>%ungroup()# Order channels by total volumechannel_order <- data_summary_fleet_channel %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Channel_Total =sum(Total_Volume)) %>%arrange(Channel_Total) %>%pull(COLD_DRINK_CHANNEL)data_summary_fleet_channel$COLD_DRINK_CHANNEL <-factor( data_summary_fleet_channel$COLD_DRINK_CHANNEL,levels = channel_order)# Plotggplot(data_summary_fleet_channel, aes(x = Total_Volume, y = COLD_DRINK_CHANNEL, fill = NEW_FLEET)) +geom_bar(stat ="identity", position ="stack", alpha =0.7) +geom_text(aes(label =paste0(Percentage, "%")),position =position_stack(vjust =0.5),hjust =0.2,size =3.2,color ="black" ) +geom_vline(xintercept =c(2500000, 5000000, 7500000, 10000000),color ="lightgray",linewidth =0.5 ) +scale_x_continuous(labels =comma_format(scale =1e-6, suffix ="M"),breaks =c(2500000, 5000000, 7500000, 10000000),expand =expansion(mult =c(0, 0.05)) ) +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +labs(title ="Our Recommendation - Total Volume by Cold Drink Channel",x ="Total Volume (in Millions)",y =NULL,fill ="New Fleet Type" ) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.text.x =element_text(size =10),legend.position ="bottom",legend.box ="horizontal",panel.grid.major =element_blank(),panel.grid.minor =element_blank() )```With the recommendation, the dining segment saw a 7% reduction in the number of customers previously served by red trucks, who are now being served by white trucks.Events and Public Sector experienced a near 5% reduction in customers served by red trucks. The remaining segments saw changes of less than 2%.The conventional segment was not displayed due to the low volume, but the change in this segment was also less than 2%.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}#400 gallon threshold by cold drink - number of customers# Summarize data by COLD_DRINK_CHANNEL and FLEET_TYPE, counting unique customersdata_summary <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL, FLEET_TYPE) %>%summarise(Unique_Customers =n_distinct(CUSTOMER_NUMBER),.groups ='drop' ) %>%group_by(COLD_DRINK_CHANNEL) %>%mutate(Percentage =round(Unique_Customers /sum(Unique_Customers) *100, 1)) %>%ungroup()# Create horizontal bar chartggplot(data_summary, aes(x = Unique_Customers, y =reorder(COLD_DRINK_CHANNEL, Unique_Customers), fill = FLEET_TYPE)) +geom_bar(stat ="identity", position ="stack", alpha =0.7) +geom_text(aes(label =paste0(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="400 - Percentage of Unique Customers by Cold Drink Channel and Fleet Type",x ="Number of Unique Customers",y =NULL ) +scale_x_continuous(labels = scales::comma_format()) +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.title.x =element_blank(),axis.text.x =element_text(size =10),legend.position ="right",panel.grid.major =element_blank(),panel.grid.minor =element_blank() )############### Our Recommendation - By number of customers# Summarize data by COLD_DRINK_CHANNEL and NEW_FLEET, counting unique customersdata_summary <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL, NEW_FLEET) %>%summarise(Unique_Customers =n_distinct(CUSTOMER_NUMBER),.groups ='drop' ) %>%group_by(COLD_DRINK_CHANNEL) %>%mutate(Percentage =round(Unique_Customers /sum(Unique_Customers) *100, 1)) %>%ungroup()# Create horizontal bar chartggplot(data_summary, aes(x = Unique_Customers, y =reorder(COLD_DRINK_CHANNEL, Unique_Customers), fill = NEW_FLEET)) +geom_bar(stat ="identity", position ="stack", alpha =0.7) +geom_text(aes(label =paste0(Percentage, "%")), position =position_stack(vjust =0.5), hjust =-0.01, color ="black", size =3.2) +labs(title ="Our Rec. - Percentage of Unique Customers by Cold Drink Channel and Fleet Type",x ="Number of Unique Customers",y =NULL ) +scale_x_continuous(labels = scales::comma_format()) +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.title.x =element_blank(),axis.text.x =element_text(size =10),legend.position ="right",panel.grid.major =element_blank(),panel.grid.minor =element_blank() )```### 10.3 Impact on Order Types```{r}# Merge new fleet on full_datafull_data <- full_data %>%left_join(full_data_customer %>% dplyr::select(CUSTOMER_NUMBER, NEW_FLEET), by ="CUSTOMER_NUMBER")# Summarize by ORDER_TYPE and NEW_FLEET using delivered volumedata_summary_fleet_by_order <- full_data %>%filter(!is.na(NEW_FLEET), !is.na(ORDER_TYPE)) %>%group_by(ORDER_TYPE, NEW_FLEET) %>%summarise(TotalDelivered =sum(DELIVERED_CASES + DELIVERED_GALLONS, na.rm =TRUE), .groups ="drop") %>%group_by(ORDER_TYPE) %>%mutate(Percentage =round(TotalDelivered /sum(TotalDelivered) *100, 0))# Order ORDER_TYPE by total delivered volumeorder_levels <- data_summary_fleet_by_order %>%group_by(ORDER_TYPE) %>%summarise(Total =sum(TotalDelivered), .groups ="drop") %>%arrange(Total) %>%pull(ORDER_TYPE)# Reorder as factordata_summary_fleet_by_order$ORDER_TYPE <-factor(data_summary_fleet_by_order$ORDER_TYPE, levels = order_levels)# Plotggplot(data_summary_fleet_by_order, aes(x = TotalDelivered, y = ORDER_TYPE, fill = NEW_FLEET)) +geom_bar(stat ="identity", position ="stack", alpha =0.6) +geom_text(aes(label =paste0(Percentage, "%")), position =position_stack(vjust =0.5), hjust =0, color ="black", size =3.2) +labs(title ="Our Recommendation - Delivered Volume by Order Type", x ="Volume (units)", y =NULL, fill ="New Fleet Type") +scale_x_continuous(labels =function(x) paste0(x /1e6, "M"),breaks =c(2500000, 5000000, 7500000, 10000000),expand =expansion(c(0, 0.05)) ) +scale_fill_manual(values =c("RED TRUCK"="#B33951", "WHITE TRUCK"="#D3D3D3")) +theme_minimal() +theme(plot.title =element_text(size =10, face ="bold"),axis.text.y =element_text(size =10),axis.title.x =element_text(size =10, face ="plain"),legend.position ="right",legend.direction ="vertical",panel.grid.major.y =element_blank(),panel.grid.major.x =element_line(color ="lightgray", size =0.5),panel.grid.minor =element_blank() )```The key takeaway here is that the volume served by sales reps would see only a slight reduction of about 2 percent compared to the 400 gallon threshold. This helps avoid abrupt changes that could potentially harm relationships with customers who have closer contact with our sales team.The most significant shift however would occur with orders placed through the call center. Approximately 20 percent of the volume that would have been served by red trucks under the 400 gallon threshold would now be served by white trucks. This allows red trucks to be redirected to other types of orders with greater potential to strengthen customer relationships.### 10.4 Customers Impacted**All Customers**```{r}# Create WHITE_TO_RED:# Assign 0 if both FLEET_TYPE and NEW_FLEET are "WHITE TRUCK", otherwise assign 1full_data_customer$WHITE_TO_RED <-ifelse( full_data_customer$FLEET_TYPE =="WHITE TRUCK"& full_data_customer$NEW_FLEET =="WHITE TRUCK",0, 1)# Create RED_TO_WHITE:# Assign 0 if both FLEET_TYPE and NEW_FLEET are "RED TRUCK", otherwise assign 1full_data_customer$RED_TO_WHITE <-ifelse( full_data_customer$FLEET_TYPE =="RED TRUCK"& full_data_customer$NEW_FLEET =="RED TRUCK",0, 1)full_data_customer$CHANGED_FLEET <-ifelse( full_data_customer$FLEET_TYPE != full_data_customer$NEW_FLEET,"Yes", "No")# Create fleet transition categoriesfleet_change_summary <- full_data_customer %>%mutate(FLEET_STATUS =case_when( FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="RED TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="RED TRUCK"~"Stayed Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Stayed White",TRUE~"Other" )) %>%filter(FLEET_STATUS !="Other") %>%mutate(FLEET_STATUS =factor(FLEET_STATUS, levels =c("Stayed White", "Stayed Red", "Changed Fleet"))) %>%group_by(FLEET_STATUS) %>%summarise(Num_Customers =n(), .groups ="drop") %>%mutate(Percentage = Num_Customers /sum(Num_Customers) *100,Label =paste0(Num_Customers, " (", round(Percentage, 1), "%)"))# Custom colorsfleet_colors <-c("Stayed Red"="#B33951","Stayed White"="#D3D3D3","Changed Fleet"="plum")# Plot with value and percentage labelsggplot(fleet_change_summary, aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +labs(title ="Number of Customers by Fleet Type (400 gal X New Recommendation)",x ="",y ="Number of Customers" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold"),panel.grid.major =element_blank(),panel.grid.minor =element_blank() )```Among all customers, compared to the 400-gallon threshold, 14 percent (4,325) would have their fleet assignment changed, either from red truck to white truck or vice versa.These 4,235 customers represent 9.3% of the total volume sold in 2023 and 2024. Of these, 2,461 would switch from white trucks to red trucks (20% of the white truck volume), while 1,774 would switch from red trucks to white trucks (7.4% of the red truck volume).```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Create fleet transition categoriesfleet_change_summary <- full_data_customer %>%mutate(FLEET_STATUS =case_when( FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="RED TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="RED TRUCK"~"Stayed Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Stayed White",TRUE~"Other" )) %>%filter(FLEET_STATUS !="Other") %>%mutate(FLEET_STATUS =factor(FLEET_STATUS, levels =c("Stayed White", "Stayed Red", "Changed Fleet"))) %>%group_by(FLEET_STATUS) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ="drop" ) %>%mutate(# Convert total volume to millions and calculate percentageTotal_Volume_Million = Total_Volume /1e6,Percentage = Total_Volume_Million /sum(Total_Volume_Million) *100,Label =paste0(round(Total_Volume_Million, 1), "M (", round(Percentage, 1), "%)") )# Custom colorsfleet_colors <-c("Stayed Red"="#B33951","Stayed White"="#D3D3D3","Changed Fleet"="plum")# Plot with volume and percentage labelsggplot(fleet_change_summary, aes(x = FLEET_STATUS, y = Total_Volume_Million, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +# Adjust width heregeom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +labs(title ="Volume Change per Fleet type (400 gal X New Rec.)",x ="",y ="Total Volume (Million)" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold") )# Define levels to control the bar orderfleet_status_levels <-c("Stayed Red", "Red to White", "Stayed White", "White to Red")# Prepare the datafleet_change_data <- full_data_customer %>%mutate(FLEET_ORIGIN =case_when( FLEET_TYPE =="RED TRUCK"~"RED_TO_WHITE", FLEET_TYPE =="WHITE TRUCK"~"WHITE_TO_RED",TRUE~"Other" ),FLEET_STATUS =case_when( FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Red to White", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="RED TRUCK"~"Stayed Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="RED TRUCK"~"White to Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Stayed White",TRUE~"Other" ),FLEET_STATUS =factor(FLEET_STATUS, levels = fleet_status_levels) )# Summarize and calculate percentagesfleet_change_summary <- fleet_change_data %>%filter(FLEET_ORIGIN %in%c("RED_TO_WHITE", "WHITE_TO_RED")) %>%group_by(FLEET_ORIGIN, FLEET_STATUS) %>%summarise(Num_Customers =n(), .groups ="drop") %>%group_by(FLEET_ORIGIN) %>%mutate(Percentage = Num_Customers /sum(Num_Customers) *100,Label =paste0(Num_Customers, " (", round(Percentage, 1), "%)") )# Define custom fill colors (no "Other" category)fleet_colors <-c("Red to White"="#D3D3D3","White to Red"="#B33951","Stayed Red"="#B33951","Stayed White"="#D3D3D3")# Plot for Red to Whiteggplot(fleet_change_summary %>%filter(FLEET_ORIGIN =="RED_TO_WHITE"), aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +scale_y_continuous(limits =c(0, 25000)) +labs(title ="Number of Customers: Red Truck to White Truck",x ="",y ="Number of Customers" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold") )# Plot for White to Redggplot(fleet_change_summary %>%filter(FLEET_ORIGIN =="WHITE_TO_RED"), aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +scale_y_continuous(limits =c(0, 25000)) +labs(title ="Number of Customers: White Truck to Red Truck",x ="",y ="Number of Customers" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold") )# Define levels to control the bar orderfleet_status_levels <-c("Stayed Red", "Red to White", "Stayed White", "White to Red")# Prepare the datafleet_change_data <- full_data_customer %>%mutate(FLEET_ORIGIN =case_when( FLEET_TYPE =="RED TRUCK"~"RED_TO_WHITE", FLEET_TYPE =="WHITE TRUCK"~"WHITE_TO_RED",TRUE~"Other" ),FLEET_STATUS =case_when( FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Red to White", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="RED TRUCK"~"Stayed Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="RED TRUCK"~"White to Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Stayed White",TRUE~"Other" ),FLEET_STATUS =factor(FLEET_STATUS, levels = fleet_status_levels) )# Summarize and calculate total volume and percentagesfleet_change_summary <- fleet_change_data %>%filter(FLEET_ORIGIN %in%c("RED_TO_WHITE", "WHITE_TO_RED")) %>%group_by(FLEET_ORIGIN, FLEET_STATUS) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ="drop" ) %>%group_by(FLEET_ORIGIN) %>%mutate(# Convert total volume to millions and calculate percentageTotal_Volume_Million = Total_Volume /1e6,Percentage = Total_Volume_Million /sum(Total_Volume_Million) *100,Label =paste0(round(Total_Volume_Million, 1), "M (", round(Percentage, 1), "%)") )# Define custom fill colors (no "Other" category)fleet_colors <-c("Red to White"="#D3D3D3","White to Red"="#B33951","Stayed Red"="#B33951","Stayed White"="#D3D3D3")# Plot for Red to Whiteggplot(fleet_change_summary %>%filter(FLEET_ORIGIN =="RED_TO_WHITE"), aes(x = FLEET_STATUS, y = Total_Volume_Million, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +scale_y_continuous(limits =c(0, 30)) +labs(title ="Volume: Red Truck to White Truck",x ="",y ="Total Volume (Million)" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold") )# Plot for White to Redggplot(fleet_change_summary %>%filter(FLEET_ORIGIN =="WHITE_TO_RED"), aes(x = FLEET_STATUS, y = Total_Volume_Million, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +scale_y_continuous(limits =c(0, 30)) +labs(title ="Volume: White Truck to Red Truck",x ="",y ="Total Volume (Million)" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold") )```**Local Market Partners - Local Fountain Only**```{r}# Create fleet transition categories for LOCAL_FOUNT_ONLY customersfleet_change_summary <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%# Filter only LOCAL_FOUNT_ONLY customersmutate(FLEET_STATUS =case_when( FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="RED TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="RED TRUCK"~"Stayed Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Stayed White",TRUE~"Other" )) %>%filter(FLEET_STATUS !="Other") %>%mutate(FLEET_STATUS =factor(FLEET_STATUS, levels =c("Stayed White", "Stayed Red", "Changed Fleet"))) %>%group_by(FLEET_STATUS) %>%summarise(Num_Customers =n(), .groups ="drop") %>%mutate(Percentage = Num_Customers /sum(Num_Customers) *100,Label =paste0(Num_Customers, " (", round(Percentage, 1), "%)"))# Custom colorsfleet_colors <-c("Stayed Red"="#B33951","Stayed White"="#D3D3D3","Changed Fleet"="plum")# Plot with value and percentage labelsggplot(fleet_change_summary, aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +# Adjust width heregeom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +labs(title ="LFO Number of Customers (400 gal X New Recommendation)",x ="",y ="Number of Customers" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold"),panel.grid.major =element_blank(), # Remove major grid linespanel.grid.minor =element_blank() # Remove minor grid lines )```Among local market partners, 148 customers (11%) would switch fleets, making up 25% of the group’s total volume. Of these, 95 switched from red trucks to white trucks, which is 52% of red truck customers and 37% of the red truck volume in this group.Additionally, 53 customers switched from white trucks to red trucks, representing 4.5% of white truck customers and 9% of the white truck volume in this group.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Create fleet transition categories for LOCAL_FOUNT_ONLY customersfleet_change_summary <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%# Filter only LOCAL_FOUNT_ONLY customersmutate(FLEET_STATUS =case_when( FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="RED TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Changed Fleet", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="RED TRUCK"~"Stayed Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Stayed White",TRUE~"Other" )) %>%filter(FLEET_STATUS !="Other") %>%mutate(FLEET_STATUS =factor(FLEET_STATUS, levels =c("Stayed White", "Stayed Red", "Changed Fleet"))) %>%group_by(FLEET_STATUS) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ="drop" ) %>%mutate(Total_Volume_Million = Total_Volume /1e6,Percentage = Total_Volume_Million /sum(Total_Volume_Million) *100,Label =paste0(round(Total_Volume_Million, 1), "M (", round(Percentage, 1), "%)") )# Custom colorsfleet_colors <-c("Stayed Red"="#B33951","Stayed White"="#D3D3D3","Changed Fleet"="plum")# Plot with volume and percentage labels for LOCAL_FOUNT_ONLY customersggplot(fleet_change_summary, aes(x = FLEET_STATUS, y = Total_Volume_Million, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +labs(title ="Volume Change per Fleet type (400 gal X New Rec.)",x ="",y ="Total Volume (Million)" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold") )# Prepare the data with LOCAL_FOUNT_ONLY filterfleet_change_data <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%# Filter only LOCAL_FOUNT_ONLY customersmutate(FLEET_ORIGIN =case_when( FLEET_TYPE =="RED TRUCK"~"RED_TO_WHITE", FLEET_TYPE =="WHITE TRUCK"~"WHITE_TO_RED",TRUE~"Other" ),FLEET_STATUS =case_when( FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Red to White", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="RED TRUCK"~"Stayed Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="RED TRUCK"~"White to Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Stayed White",TRUE~"Other" ),FLEET_STATUS =factor(FLEET_STATUS, levels = fleet_status_levels) )# Summarize and calculate percentages for LOCAL_FOUNT_ONLY customersfleet_change_summary <- fleet_change_data %>%filter(FLEET_ORIGIN %in%c("RED_TO_WHITE", "WHITE_TO_RED")) %>%group_by(FLEET_ORIGIN, FLEET_STATUS) %>%summarise(Num_Customers =n(), .groups ="drop") %>%group_by(FLEET_ORIGIN) %>%mutate(Percentage = Num_Customers /sum(Num_Customers) *100,Label =paste0(Num_Customers, " (", round(Percentage, 1), "%)") )# Define custom fill colors (no "Other" category)fleet_colors <-c("Red to White"="#D3D3D3","White to Red"="#B33951","Stayed Red"="#B33951","Stayed White"="#D3D3D3")# Plot for Red to White (LOCAL_FOUNT_ONLY customers)ggplot(fleet_change_summary %>%filter(FLEET_ORIGIN =="RED_TO_WHITE"), aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +scale_y_continuous(limits =c(0, 25000)) +labs(title ="Number of Customers: Red Truck to White Truck",x ="",y ="Number of Customers" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold") )# Plot for White to Red (LOCAL_FOUNT_ONLY customers)ggplot(fleet_change_summary %>%filter(FLEET_ORIGIN =="WHITE_TO_RED"), aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +scale_y_continuous(limits =c(0, 25000)) +labs(title ="Number of Customers: White Truck to Red Truck",x ="",y ="Number of Customers" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold") )# Summarize and calculate total volume and percentages for LOCAL_FOUNT_ONLY customersfleet_change_summary <- fleet_change_data %>%filter(FLEET_ORIGIN %in%c("RED_TO_WHITE", "WHITE_TO_RED")) %>%group_by(FLEET_ORIGIN, FLEET_STATUS) %>%summarise(Total_Volume =sum(QTD_DLV_GAL_2023, na.rm =TRUE) +sum(QTD_DLV_GAL_2024, na.rm =TRUE) +sum(QTD_DLV_CA_2023, na.rm =TRUE) +sum(QTD_DLV_CA_2024, na.rm =TRUE),.groups ="drop" ) %>%group_by(FLEET_ORIGIN) %>%mutate(Total_Volume_Million = Total_Volume /1e6,Percentage = Total_Volume_Million /sum(Total_Volume_Million) *100,Label =paste0(round(Total_Volume_Million, 1), "M (", round(Percentage, 1), "%)") )# Plot for Red to White (Volume) - LOCAL_FOUNT_ONLY customersggplot(fleet_change_summary %>%filter(FLEET_ORIGIN =="RED_TO_WHITE"), aes(x = FLEET_STATUS, y = Total_Volume_Million, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +scale_y_continuous(limits =c(0, 30)) +labs(title ="Volume: Red Truck to White Truck",x ="",y ="Total Volume (Million)" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold") )# Plot for White to Red (Volume) - LOCAL_FOUNT_ONLY customersggplot(fleet_change_summary %>%filter(FLEET_ORIGIN =="WHITE_TO_RED"), aes(x = FLEET_STATUS, y = Total_Volume_Million, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +scale_y_continuous(limits =c(0, 30)) +labs(title ="Volume: White Truck to Red Truck",x ="",y ="Total Volume (Million)" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold") )```**Impacts on Chain Members**```{r}# Define levels to control the bar orderfleet_status_levels <-c("Stayed Red", "Red to White", "Stayed White", "White to Red")# Prepare the data with CHAIN_MEMBER includedfleet_change_data <- full_data_customer %>%mutate(FLEET_ORIGIN =case_when( FLEET_TYPE =="RED TRUCK"~"RED_TO_WHITE", FLEET_TYPE =="WHITE TRUCK"~"WHITE_TO_RED",TRUE~"Other" ),FLEET_STATUS =case_when( FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Red to White", FLEET_TYPE =="RED TRUCK"& NEW_FLEET =="RED TRUCK"~"Stayed Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="RED TRUCK"~"White to Red", FLEET_TYPE =="WHITE TRUCK"& NEW_FLEET =="WHITE TRUCK"~"Stayed White",TRUE~"Other" ),FLEET_STATUS =factor(FLEET_STATUS, levels = fleet_status_levels),CHAIN_MEMBER =as.factor(CHAIN_MEMBER) )# Summarize with CHAIN_MEMBERfleet_change_summary <- fleet_change_data %>%filter(FLEET_ORIGIN %in%c("RED_TO_WHITE", "WHITE_TO_RED")) %>%group_by(FLEET_ORIGIN, CHAIN_MEMBER, FLEET_STATUS) %>%summarise(Num_Customers =n(), .groups ="drop") %>%group_by(FLEET_ORIGIN, CHAIN_MEMBER) %>%mutate(Percentage = Num_Customers /sum(Num_Customers) *100,Label =paste0(Num_Customers, " (", round(Percentage, 1), "%)") )# Define custom colorsfleet_colors <-c("Red to White"="#D3D3D3","White to Red"="#B33951","Stayed Red"="#B33951","Stayed White"="#D3D3D3")# Plot functionplot_fleet_change <-function(origin) {ggplot(fleet_change_summary %>%filter(FLEET_ORIGIN == origin), aes(x = FLEET_STATUS, y = Num_Customers, fill = FLEET_STATUS)) +geom_col(show.legend =FALSE, width =0.45) +geom_text(aes(label = Label), vjust =-0.5, size =3.5) +scale_fill_manual(values = fleet_colors) +facet_wrap(~ CHAIN_MEMBER, labeller = label_both) +scale_y_continuous(limits =c(0, 15000)) +# Y scale set to 0–15000labs(title =paste("Fleet Transition:", gsub("_", " ", origin)),x ="",y ="Number of Customers" ) +theme_minimal() +theme(axis.text.x =element_text(size =10),plot.title =element_text(face ="bold"),panel.grid.major.x =element_blank(), # Remove vertical grid linespanel.grid.minor.x =element_blank() )}# Plotsplot_fleet_change("RED_TO_WHITE")plot_fleet_change("WHITE_TO_RED")```Among customers who are chain members (CHAIN_MEMBER = 1) and who, based on the 400-gallon threshold, should be served by red trucks, 17% would now be served by white trucks instead. This shift raises the question of whether there could be a negative impact due to the inconsistent service model within the same customer group.In parallel, 16% of customers who should be served by white trucks under the same threshold would now be served by red trucks. This inversion in fleet assignment suggests a possible misalignment with the intended operational segmentation, and should be further evaluated to ensure customer experience and operational efficiency are not compromised.### 10.5 Impact on Customer Segments (clusters)Below is the visualization of customers by cluster who would change their fleet assignment based on their consumption and number of orders.```{r, warning=FALSE}# Define custom colors for the fleet and clusterspalette_fleet <-c("RED TRUCK"="#B33951", # Red truck"WHITE TRUCK"="#D3D3D3"# White truck)palette_clusters <-c("Cluster 1"="#FF6347", # Coral"Cluster 2"="#4682B4", # Cornflower blue"Cluster 3"="#FFD700"# Yellow)# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Filter only customers with CHANGED_FLEET == "Yes"changed_fleet_data <- full_data_customer %>%filter(CHANGED_FLEET =="Yes") %>%filter(NEW_FLEET %in%c("RED TRUCK", "WHITE TRUCK")) %>%mutate(Fleet_Type = NEW_FLEET)# Define a custom labeller for the clusterscustom_labeller <-labeller(CLUSTER =c("1"="Cluster 1","2"="Cluster 2","3"="Cluster 3" ))# Create scatter plot with log scales and no background color for facet labelsggplot(changed_fleet_data) +geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = Fleet_Type),alpha =0.5, width =0.2) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +# Log scale for x-axis with specific limitsscale_y_log10(limits =c(10, 1000000),breaks =c(10, 100, 1000, 10000, 100000, 1000000),labels = scales::comma ) +scale_color_manual(values = palette_fleet) +scale_linetype_manual(values ="solid", name ="") +labs(title ="Customers who changed truck assignments by Cluster" ,x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ CLUSTER + Fleet_Type, scales ="fixed", nrow =1, labeller = custom_labeller) +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =9), # Adjust text size for facet labelsstrip.background =element_blank(), # Remove background color from facet labelspanel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.minor =element_blank(),panel.background =element_rect(fill ="white", color ="white"),legend.position ="bottom",legend.box ="vertical" ) +guides(color ="none") # Remove legend for Fleet_Type```Out of the 425 customers who would change their fleet assignment:These customers represent 9.3% of the total volume.- Cluster Breakdown:1,273 customers from Cluster 1 will now be served by red trucks.1,188 customers from Cluster 2 switched from white trucks to red trucks, reflecting high potential, recency, and order frequency.- Additionally:1,748 customers from Cluster 2 switched from red trucks to white trucks.26 customers from Cluster 3 switched from red trucks to white trucks.```{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}# Summarize CHANGED_FLEET by CLUSTERfull_data_customer %>%group_by(CLUSTER, CHANGED_FLEET) %>%summarise(Count =n(), .groups ="drop") %>% tidyr::pivot_wider(names_from = CHANGED_FLEET, values_from = Count, values_fill =0)full_data_customer %>%filter(CHANGED_FLEET =="Yes") %>%group_by(CLUSTER, FLEET_TYPE, NEW_FLEET) %>%summarise(Count =n(),.groups ="drop" )```**Customer Segmentation and Cold Drink Channel**Among the customers who would change fleet assignments, the majority belong to the Dining segment (52%), where 962 would switch from red trucks to white trucks, and 1,229 would switch from white trucks to red trucks. The second-largest segment with changes is GOODS (19%), where 251 customers would switch from red trucks to white trucks, and 549 would switch from white trucks to red trucks. The EVENT segment (9%) would have 226 customers switching from red trucks to white trucks, while 159 would switch from white trucks to red trucks. ```{r,warning=FALSE}# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Define colors for NEW_FLEETfleet_colors <-c("RED TRUCK"="#B33951","WHITE TRUCK"="#D3D3D3")# Filter data for changed fleetfiltered_data <- full_data_customer %>%filter(CHANGED_FLEET =="Yes")# Calculate the number of unique customersunique_customers <- filtered_data %>%summarise(unique_customers =n_distinct(CUSTOMER_NUMBER))#print(unique_customers)# Create scatter plot colored by NEW_FLEETggplot(filtered_data) +geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = NEW_FLEET),alpha =0.5, width =0.2) +geom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +scale_y_log10(limits =c(10, 100000),breaks =c(10, 100, 1000, 10000, 100000),labels = scales::comma ) +scale_color_manual(values = fleet_colors) +scale_linetype_manual(values ="solid", name =NULL) +labs(title ="Customers Who Changed Fleet Assignment",subtitle ="Average Annual Consumption vs. Number of Orders by Cold Drink Channel",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ COLD_DRINK_CHANNEL, scales ="fixed") +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =10),panel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.minor =element_blank(),panel.background =element_rect(fill ="white", color ="white"),legend.position ="bottom" )``````{r, echo=FALSE, results='hide', message=FALSE, warning=FALSE, fig.show='hide'}full_data_customer %>%filter(CHANGED_FLEET =="Yes") %>%group_by(COLD_DRINK_CHANNEL, FLEET_TYPE, NEW_FLEET) %>%summarise(Count =n(),.groups ="drop" )# Define the custom color palette for COLD_DRINK_CHANNEL with unique colorscold_drink_channel_colors <-c("DINING"="#A7ADC6", "PUBLIC SECTOR"="#FF6347", "EVENT"="#B33951", "WORKPLACE"="#ABD2FA", "ACCOMMODATION"="#E377C2", "GOODS"="#FFD700", "BULK TRADE"="#8ED081", "WELLNESS"="#20B2AA", "CONVENTIONAL"="#1F77B4")# Create a data frame for the threshold linethreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Create scatter plot with log scales for all clusters combinedggplot(full_data_customer) +geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color =as.factor(COLD_DRINK_CHANNEL)),alpha =0.5, width =0.2) +# Jitter to avoid overplottinggeom_line(data = threshold_line, aes(x = x, y = y, linetype = type), color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +# Log scale for x-axis with specific limits (1 -> 500)scale_y_log10(limits =c(10, 100000), # Adjusted the upper limit to 100,000breaks =c(10, 100, 1000, 10000, 100000), # Adjusted breakpointslabels = scales::comma # Format numbers with commas ) +scale_color_manual(values = cold_drink_channel_colors) +# Apply custom colors to COLD_DRINK_CHANNELscale_linetype_manual(values ="solid", name =NULL) +# Remove title from legendlabs(title ="Annual Consumption vs. Number of Orders by Cold Drink Channel",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)" ) +facet_wrap(~ COLD_DRINK_CHANNEL, scales ="fixed") +# Facet by COLD_DRINK_CHANNEL onlyguides(color ="none") +# Remove legend for COLD_DRINK_CHANNELtheme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =10), # Adjust facet labels' sizepanel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3), # Major Y grid lines for specific breakspanel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3), # Light gray vertical grid lines as backgroundpanel.grid.minor =element_blank(), # Remove minor grid linespanel.background =element_rect(fill ="white", color ="white"), # Ensure clean backgroundlegend.position ="bottom"# Move legend below the plot )# Count the number of unique customers#num_unique_customers <- length(unique(full_data_customer$CUSTOMER_NUMBER))# Display the number of unique customers#num_unique_customers# Table with counts and percentages of customers who changed fleet assignment by cold drink channelfleet_summary <- full_data_customer %>%filter(CHANGED_FLEET =="Yes") %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Count =n(),.groups ="drop" ) %>%mutate(Percentage =round(Count /sum(Count) *100, 1) )# Display the summary tableprint(fleet_summary)# Total number of customers in full_data_customer by channeltotal_customers_by_channel <- full_data_customer %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Customers =n(), .groups ="drop")# Number of customers who changed fleet by channelchanged_fleet_summary <- full_data_customer %>%filter(CHANGED_FLEET =="Yes") %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Changed_Count =n(),.groups ="drop" )# Join and calculate percentagesfleet_summary <- changed_fleet_summary %>%left_join(total_customers_by_channel, by ="COLD_DRINK_CHANNEL") %>%mutate(Percent_Among_Changed =round(Changed_Count /sum(Changed_Count) *100, 1),Percent_of_Total_Channel =round(Changed_Count / Total_Customers *100, 1) ) %>%arrange(desc(Changed_Count))# Display the summary tableprint(fleet_summary)# Total number of LOCAL_FOUNT_ONLY == 1 by channeltotal_local_fount <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1) %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Total_Local_Fount =n(), .groups ="drop")# Number of LOCAL_FOUNT_ONLY == 1 and CHANGED_FLEET == "Yes" by channelchanged_local_fount <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1, CHANGED_FLEET =="Yes") %>%group_by(COLD_DRINK_CHANNEL) %>%summarise(Changed_Local_Fount =n(),.groups ="drop" )# Join and calculate percentageslocal_fount_summary <- changed_local_fount %>%left_join(total_local_fount, by ="COLD_DRINK_CHANNEL") %>%mutate(Percent_Among_Changed =round(Changed_Local_Fount /sum(Changed_Local_Fount) *100, 1),Percent_of_Total_Channel =round(Changed_Local_Fount / Total_Local_Fount *100, 1) ) %>%arrange(desc(Changed_Local_Fount))# Display the summary tableprint(local_fount_summary)full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1, CHANGED_FLEET =="Yes") %>%group_by(COLD_DRINK_CHANNEL, FLEET_TYPE, NEW_FLEET) %>%summarise(Count =n(),.groups ="drop" )``````{r}# Define colors based on NEW_FLEETfleet_colors <-c("RED TRUCK"="#B33951", # Red"WHITE TRUCK"="#D3D3D3"# Light gray)# Filter the data for LOCAL_FOUNT_ONLY == 1 and CHANGED_FLEET == "Yes"filtered_local_fount <- full_data_customer %>%filter(LOCAL_FOUNT_ONLY ==1, CHANGED_FLEET =="Yes")# Threshold line for 400 gallonsthreshold_line <-data.frame(x =c(1, 500),y =c(400, 400),type ="400 Gallons Threshold")# Updated plotggplot(filtered_local_fount) +geom_jitter(aes(x = NUM_ORDERS, y = AVG_ANNUAL_CONSUMP, color = NEW_FLEET),alpha =0.6, width =0.2) +geom_line(data = threshold_line,aes(x = x, y = y, linetype = type),color ="red", size =1) +scale_x_log10(limits =c(1, 500)) +scale_y_log10(limits =c(10, 100000),breaks =c(10, 100, 1000, 10000, 100000),labels = scales::comma ) +scale_color_manual(values = fleet_colors) +scale_linetype_manual(values ="solid", name =NULL) +labs(title ="Local Fountain Only Customers Who Changed Fleet Assignment",subtitle ="Average Annual Consumption vs. Number of Orders by Cold Drink Channel",x ="Number of Orders (Log Scale)",y ="Average Annual Consumption (Log Scale)",color ="New Fleet" ) +facet_wrap(~ COLD_DRINK_CHANNEL, scales ="fixed") +theme_minimal() +theme(text =element_text(size =12),axis.text.x =element_text(size =10),axis.text.y =element_text(size =10),strip.text =element_text(size =10),panel.grid.major.y =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.major.x =element_line(color ="gray90", linetype ="solid", size =0.3),panel.grid.minor =element_blank(),panel.background =element_rect(fill ="white", color ="white"),legend.position ="bottom" )# List all variables in the environmentall_vars <-ls()# Exclude 'full_data', 'full_data_customer', and the new variables from removalvars_to_keep <-c("full_data", "full_data_customer","mydir", "one_seed", "reference_date")# Get the variables to removevars_to_remove <-setdiff(all_vars, vars_to_keep)# Remove the temporary data framesrm(list = vars_to_remove)# Clean up by removing 'all_vars' and 'vars_to_remove'rm(all_vars, vars_to_remove)```Among the local market partners with fountain drink only, nearly 90% of the fleet changes would occur in the Dining segment. In this group, 85 customers would switch from white trucks to red trucks, and 45 would switch from red trucks to white trucks.## 11. Business Value and Final ConclusionsThe proposed fleet reassignment strategy has the potential to save approximately $770,000 for the company over the past two years by increasing the number of customers served by red trucks, optimizing their usage frequency, and reducing their volume by 3%, which would allow for the eventual redeployment to strategic customers.The proposal was quite conservative, redesigning the delivery method for only 14% of the customers and was able to assign the fleet not only based on volume but on several intrinsic customer characteristics. Therefore, the expectation is that after its implementation, there will be gains not only in cost reduction but also in sales increase, mainly for customers with greater growth potential. In addition, the proposal allowed the identification of three main customer groups, two of which showed good homogeneity.When measuring the impacts of the new fleet assignment, the dining segment was the most impacted by these changes, particularly for the local market partners classified as fountain only. There was no significant impact on the activities of sales representatives, but there was a significant impact in reducing the volumes delivered by red trucks (-20%) when orders are placed through call centers, which is actually a good outcome since orders through call centers no longer had a strong relationship with customers.A differentiator for the process was the feature engineering, which brought robustness to the clustering. The supervised models, Decision Tree and Multinomial Logistic Regression, were very important in explaining the variables that influenced the clusters and, with their accuracy being raised (close to 90%), they have the potential to predict segments for new customers.**Limitations, Improvements, and Lessons**One of the main limitations of this project was the short two-year historical data, which made it difficult to predict the future impact of the recommendation. Analytical approaches were challenging due to the wide probability ranges, meaning that any outcome was possible.Another challenge was the asynchrony between customer orders, which made it hard to track individual customer growth tied to specific times of the year. With a longer historical series, we could have made more accurate future predictions.The census data could have been better utilized. The way it was applied in this project didn’t deliver the expected results, but with adjustments and more historical data, it could provide valuable insights for future analysis.It’s clear that predicting future growth, even with extensive data, is a complex task. These predictions should only be emphasized if the process is robust, with strong statistical support and a consistent range of possible outcomes. Otherwise, it might be better to refrain from highlighting them.Looking ahead, I strongly recommend conducting further tests to measure the impact of fleet allocation and the way customers place orders. This will be crucial in validating or refining the current approach. Additionally, analyzing revenue could provide deeper business insights, especially in understanding margins across different customer segments.A key takeaway from this project is that data doesn’t always provide all the answers we need for decision-making. In these cases, history shows that there will be both successes and setbacks, but decisions still need to be made. My role was to make responsible recommendations and take a clear stance, even when faced with uncertainties.